First Task (2012):
Cumulative Citation Recommendations (CCR)
Definition: CCR is a type of filtering task that resembles entity linking. Systems must select items from a stream of content and associate them with nodes in a knowledge base (KB).
This differs from traditional entity linking in two ways. First, important details about the entity can change during the timeframe of the stream. Second, in annotating the corpus we are attempting to broaden the definition of correct content to include articles that are pertinent even if they do not explicitly mention the target entity by name, e.g. if an article discusses an artist's work or a CEO's company, then this may contain useful information about the artist or CEO (see discussion of interannotator agreement below).
For the first year (TREC 2012), we are running a basic version of CCR (details below) in which systems simply generate a score between zero and one for each article. In future years, we may consider more complex tasks. The knowledge base is a small selection of nodes from Freebase/Wikipedia primarily from Category:Living_people. We treat each KB node as a routing profile familiar from TREC-3, TREC-11 and others. This combines ideas from TREC Entity and Topic Detection & Tracking.
Before describing specific tasks, let's look at CCR in its most general sense. This has elements of grand challenge that specific evaluation tasks may not be able to capture for many years. We present this as context for defining the overall goal of KBA.
When scanning a content stream, a variety of heuristics and models can accurately associate content items with target nodes in a KB like Wikipedia. A key aspect of this is coreference resolution of textual mentions to KB entities -- also known as entity linking, which has been explored in TAC Knowledge Base Population (KBP). However, unlike the tasks in KBP, the stream nature of KBA allows the target entity to "move" as accumulated content implies modifications to the attributes, relations, and free text associated with a node. For example, the South African painter Gavin Rain has an exhibit at the Venice Biennale. After his participation was announced, a controversy emerged about the South African pavilion at the exhibit. To detect the pertinence of this controversy to the Gavin Rain KB node, a system must first detect the new relationship implied by the earlier article about his participation.
As a further difference from familiar entity linking, the article about the controversy does not mention Gavin Rain by name. Nonetheless, a good CCR system would suggest it for linking from his KB node.
To give another example of a query, consider the KB node for Carlo Ratti. This person might speak at a conference like TED and present a new project in collaboration with the municipal government of Queensland, Australia. If his TED talk about the new project emerged during the ETR, then good CCR systems would track this association and link press releases from Queensland government about his project -- even if those press releases do not mention Carlo Ratti by name.
See discussion below in Assessor Data about capturing these ideas in annotation guidelines that yield high interannotator agreement.
CCR gets progressively harder as the stream pushes the target around without confirmational feedback from human curators. This reflects real KB evolution, in which human curators tend "hot" nodes while mostly ignoring the long tail of not-yet-on-fire nodes. Fortunately, difficulty increases gradually with duration of accumulation and we can use this in measuring performance. A longer evaluation time range (ETR) corresponds to nodes that receive less frequent editorial attention. See time lag plot in slides presented at TREC 2011 planning session, which suggest that Wikipedians' reaction time to news events is only slightly better than random. We plan to write up more details about those plots in a brief note.
For more about the general goal of KBA, see the page on future task ideas.
Basic CCR is a filtering task.
Filter Queries: Selected KB nodes are treated as queries, e.g. the entity Paul Rubell could be query and would be specified by the Freebase ID "/en/paul_rubell" along with snapshots of the WP page http://en.wikipedia.org/wiki/Paul_Rubell and the corresponding Freebase entry http://www.freebase.com/view/en/paul_rubell. The query corpus will be a carefully selected cross section of KB nodes (see discussion below). Details about how we are selected nodes can be discussed in the Google Group. For the first year, we hope to create such data for approximately fifty nodes.
I/O Streams: Participants' systems get initialized with a query (a KB node) and are then required to iterate over a time-ordered stream of five months of textual news and social media content (see Data below). The stream is divided at the one-quarter mark: the first month is the training time range (TTR), and the latter four months is the evaluation time range (ETR). The human-generated labels are exposed to participants' systems during the TTR and not the ETR.
The stream includes labels indicating "pertinence" of each item to the target KB node. For Basic CCR, the threshold for pertinence is minimal: simply does the content item contain information that pertains directly to the KB node? Given the random time lags observed for most Wikipedia KB nodes, these labels are probably more useful than the stream of edits for that page, however systems may also use the page's revision history throughout the TTR but not the ETR. We plan to provide snapshots of the WP edit history for the target nodes and also in-linking and out-linked nodes.
Participants' systems are permitted to iterate over the content once for each target node. During the beginning of the stream (TTR), systems can train on the labels. During the ETR, the labels are hidden and systems must generate labels. The human-generated labels are boolean assertions about whether to recommend this content for linking from the target node, and if so, what passage from the item contains the salient info for inclusion in the KB node. We are actively refining the annotation guidelines for assessors and may have richer annotation labels than simply boolean assertions. At a mimimum, the labels will be Boolean:
( DocID, # Identifies item in the stream True|False, # Pertains? (i.e. recommend linking) passage # (optional) Salient info, or None. )
Instead of a Boolean output, participants' systems must generate a score between zero and one for each DocID. This allows us to generate curves as a function of confidence cutoff, e.g. ROC and F1 plots.
Evaluation: We will use simple and familiar metrics, like F1. We anticipate that many systems will perform well for short ETR and then eventually lose track of the target. How long a system can maintain an F-measure above some threshold is an indication of quality. Since ETR has a meaningful unit of measurement (time), we may find ways to integrate ETR duration into metrics for future tasks.
Sensitivity Problem: In Basic CCR, there is no novelty requirement, so if Justin Bieber produces a new album, and two hundred content items announce it, then in principle they are all correct choices -- they all contain information that pertains to that KB node. Also, there is no salience requirement, so if a content item clearly refers to or pertains to the target entity -- even only tangentially -- then linking it is correct. This probably allows trivial algorithms to get many correct answers, which will cause both simple and sophisticated systems to receive very similar scores. While we would like to differentiate sophisticated systems, we must keep the first year's annotation task as simple as possible.
However, annotating a large content stream is already challenging. We are explore techniques for simplifying the annotation task, and they may improve the sensitivity of Basic CCR. For example, we are aggressivly de-duplicate similar articles from the input stream. While this introduces bias, it simultaneously simplifies annotation and improves sensitivity.
In creating assessor data (see Assessor Data below), we are using several filtering tools. We plan to document at least one of these as a baseline reference system. If such a system has high precision and non-trivial recall, then its output may be provided as a label throughout both TTR and ETR. Measuring systems against such a baseline would focus the evaluation on recall gains.
We are handpicking a cross section of nodes from specific categories in the intersection of Wikipedia (WP) and Freebase (FB). The Category:Living_people has about 500k nodes. In selecting KB nodes for the first year's query corpus, we seek a cross section exhibiting a variety of lag times and that experience action during the timeframe of the content stream. See edit-versus-mention interval plots in slides presented at TREC 2011 planning session.
Potential participants have expressed a desire to avoid tabloid characters, like athletes and movie stars. For several reasons, we are focusing on lesser known entities, such as artists, criminals, local-level politicians, and interesting groups. If you have a node that you think would be interesting, please suggest it in the discussion forum.
We have been assembling a corpus of news and social media content. While the stream is more than half non-English, for TREC 2012, we are focusing on the English substream. After de-duplication, this substream is less than 500,000 content items per day.
Several factors in the collection process bias this corpus. However, the goal of this collection is to provide an information stream for systems to process chronologically.
We are actively seeking more time-ordered data, e.g. we could include earthquake and weather data. If you have ideas, let us know by posting to the Google Group.
In the terminology of Wikipedia, a link could be a citation, external link, or other form of hyperlink to a page outside the KB. Wikipedians create links to external content for many reasons. Some of these reasons are repeatable in the sense that annotators with modest training would agree on which links to create. In designing the annotation procedure for KBA, we seek to capture these repeatable reasons for linking in simple guidelines that assessors at NIST and elsewhere can easily apply. We expect to get beyond familiar entity linking to include content that does not explicitly mention the entity's name, however the need for reasonably high interannotator agreement will limit how far we can go.
In future years, it may prove useful to manually label Wikipedia edit history for a task. However, for Basic CCR, assessor data is simply node ID labels on the corpus. This is very similar to the TREC 2002 Filtering test data, and we are building on NIST's experience in building evaluation data sets for filtering and large corpora.1, 2
Similar to TREC 2002 Filtering, we plan to assemble a set of systems to adaptively filter the stream for the assessors. After we run the evaluation for the first year, we can sample the output of participants' systems to measure coverage of the evaluation data.
We are interested in collaborating with the TREC Crowdsourcing track as a means of producing assessor labels in future years.
By providing training data for a new task where there are no existing systems, we may bias the models such that certain kinds of correct associations are missed. While we are working to make comprehensive assessments for the target nodes, the training data represents user feedback and will not be complete. Participant's systems will have to deal with that.
We will post more information about the annotation tools and guidelines to the Google Group.
- Ian Soboroff, Ellen Voorhees, and Nick Craswell (September 2003), Summary of the SIGIR 2003 workshop on defining evaluation methodologies for terabyte-scale test collections SIGIR Forum, vol. 37, no. 2 (Fall 2003)
- Ian Soboroff and Stephen Robertson (July 2003) Building a Filtering Test Collection for TREC 2002 Proceedings of the 26th Annual International Conference on Research and Development in Information Retrieval (SIGIR 2003), Toronto, Ontario, Canada.