TREC 2016 Total Recall Track

Guidelines: May 31, 2016

Task Overview

The TREC 2016 Total Recall Track shares the same objectives and overall architecture as the TREC 2015 Total Recall Track. Participants should familiarize themselves with the TREC 2015 Total Recall Track Overview, currently available to teams who have registered with NIST to participate in TREC 2016, using the login credentials for "active participants" supplied by NIST.

The overall task is unchanged from 2015: Track participants will implement automatic or semi-automatic methods to identify as many relevant documents as possible, with as little review effort as possible, from document collections containing as many as 2.2 million documents. Participating systems will be run against an automated assessment server, which is unchanged from 2015 (except for the addition of new datasets). An open-source (GPL) baseline model implementation (BMI) of an automated participant system is available for download. Participants may use or modify BMI, or may implement their own system -- automated or manual -- from scratch.

Participants familiar with the 2015 guidelines and supplemental guidelines will find the following differences from the 2015 Track:

The "At Home" task will consist of one dataset -- the "athome4" collection, consisting of the same 290,000 Jeb Bush email messages used in 2015, with 34 new topics.

an alternative "athome4subset" collection, which includes only ten of the "athome4" topics, may be used by teams lacking the resources to complete the 34 topics
alternative "athome4desc" (and "athome4descsubset") collections provide one or two sentences of sentences describing each topic, in addition to the one-to-three-word topic descriptions included in "athome4" (and "athome4subset", respectively).
for each run, participants much choose one of: "athome4", "athome4subset", "athome4desc", "athome4descsubset", which are collectivelly referred to as "athome4" in these guidelines.

The "Sandbox" task will consist of two new datasets:

2.2 email messages from the Illinois gubernatorial administrations of Rod Blagojevich and Patrick Quinn, with six topics; and
800,000 Twitter "tweets," with four topics.

The documents were labeled in several different ways:

by a primary assessor as "relevant" or "not relevant"
"relevant" documents were further labeled by the primary assessor as "important" or "not important"
"relevant" documents were categorized by the primary assessor into sub-categories corresponding to different subtopics or aspects of relevance;
three independent assessors labeled a non-uniform sample of documents as "relevant" or "not relevant," and also as "important" or "not important"
the instructions given to the assessor are summarized in Total-recall-assessor-instructions.pdf

Only the primary assessor's "relevant" versus "not relevant" label will be provided as feedback during the course of the task

The other labels (in addition to the primary assessment) will be used for post-hoc evaluation measures

Results will be evaluated using both rank-based and set-based measures:

Rank-based measures evaluate recall (according to one or more of the assessments) as a function of the number of documents submitted to the assessment server. Rank-based measures will include "gain curves" and "recall @ aR+b" reported in the the TREC 2015 Total Recall Track Overview.
Set-based measures evaluate recall and the number of documents submitted to the assessment server at the point when "call your shot" is indicated by the participating system. For 2016, "call your shot" is required; systems that do not "call your shot" will be deemed to have done so at the earliest point when:

Reviewed ≥ 1.5 Relevant_Reviewed + 1000,

where Reviewed is the total number of documents submitted to the assessment server, and Relevant_Reviewed is the number of such documents that are labeled relevant by the primary assessor.

Manual participants who examine documents from the collection for any purpose (exploration, training, statistical sampling) must submit every reviewed document to the assessment server, before any subsequent documents are submitted. That is, the assessment server must be able to log all reviewed documents, whether the review is effected by the team or simulated by the automated review server.

Task Operation

The document collection, information need (topic), and an automated relevance assessor will be supplied to participants via an on-line server. After downloading the collection and information need, participants must identify documents from the collection and submit them (in batches whose size is determined by the participant) to the on-line relevance assessor. Every document submitted to the assessor is scored, and the primary assessment of relevance is returned immediately to the participant, for each document in each batch, as it is submitted. To accomplish this, the Total Recall coordinators are using collections in which every document has been pre-labeled as relevant or not and the automated assessor merely provides that label to the participant.

Participants have two objectives:

To submit as many documents containing relevant information as possible, while submitting as few documents as possible, to the automated relevance assessor. Submission continues indefinitely, and is evaluated in terms of how many relevant documents are found, as a function of the number of documents submitted.
To "call their shot" to indicate, without actually stopping, the point at which it would be reasonable to stop, because the effort to review more documents would be disproportionate to the value of any further relevant documents that might be found.

Motivating Applications

The Total Recall Track addresses the needs of searchers who want to find out everything about X, for some X. Typical examples include:

Vanity search: find out everything about me
Fandom: find out everything about my hero
Research: find out everything about my PhD topic
Investigation: find out everything about something or some activity
Systematic review: find all published studies evaluating some method or effect
Patent search: find all prior art
Electronic discovery: find all documents responsive to a request for production in a legal matter
Creating archival collections: label all relevant documents, for posterity, future IR evaluation, etc.

Evaluation Measures

There are many possible definitions for "as many documents containing relevant information as possible" and "as few documents as possible." The Total Recall Track will report traditional as well as novel measures to weigh the tradeoff between information found and effort expended.

Rank-based measures will include recall-precision curves, gain curves, and recall evaluated at aR+b documents submitted, for all combinations of a = {1, 2, 4} and b = {0, 100, 1000}.

Set-based measures measures, evaluated at the point at which "call your shot" is indicated, will include Recall and Precision, as well as aggregate measures like F₁ and other utility measures, to be announced, that balance recall with review effort.

Measures taking into account the importance, sub-topic coverage, and alternate assessments will also be computed.

Summary recall/precision/effort results will be available to participants at the end of the evaluation period, and detailed results will be presented to participants at TREC in November, and to the public in the TREC proceedings, to be published in early 2017.

"At Home" vs. "Sandbox" Evaluation

Practice collections and topics are available now to all registered TREC 2016 participants. A baseline model implementation (BMI) of a fully automated approach is available now for experimental purposes. TREC participants may access the test collections immediately, via the server, using their registered TREC Group Identifier; Participants may also access access the TREC 2015 collections athome1, athome2, and athome3 by submitting the necessary data access agreements (See below).

For the 2016 "At Home" task, one new collection, athome4, will be available to participants via the Internet, subject to the execution of the TREC 2016 Total Recall usage agreement. For this collection, participants will run their own systems, and access the automated assessor via the Internet. No prior experimentation or practice on athome4 is permitted; all runs will be logged and reported in the TREC 2016 proceedings.

Participants must declare each run to be either "automatic," meaning that no manual intervention was used once the collection was downloaded, or "manual," meaning that manual intervention -- whether parameter tweaking, searching, or full-scale document review -- was involved. If multiple runs are conducted, every run must be independent; under no circumstances may information learned from one run be used in any other. If documents are manually reviewed, the same documents must also be submitted to the assessment server, at the time they are reviewed. At Home participants will be required to complete a short questionnaire describing the nature and quantity of the manual effort involved in each run.

For the "Sandbox" task, the server for the two collections will be available only within a firewalled platform with no Internet access. Participants wishing to evaluate their systems on these datasets must submit a fully automated solution, which the Track coordinators will execute as a virtual machine within a restricted environment.

The baseline model implementation (BMI) supplied by the 2015 Total Recall Track is suitable for "Sandbox" as well as automatic "At Home"participation, and participants are free to modify it as they see fit, subject to the GNU Public License (GPL v.3).

Participants may submit their own virtual machine, perhaps containing proprietary software. In this case, participants must warrant that they have the right to use the software in this way, and the Track coordinators will in turn warrant that the submission will be used only for the purpose of evaluation within the sandbox.

Potential Strategies

Extreme relevance feedback (manual): load documents into a search engine; search to find documents that appear relevant; submit documents to the automated assessor; repeat
Extreme relevance feedback (automatic): same as above, but automate search construction and submission of results to the automated assessor
Supervised learning: identify a "training set" of documents; submit training set to the automated assessor and use the resulting assessments to train a supervised learning algorithm to rank the collection; submit batches of documents in rank order
Active learning: same as above, but use the assessments returned for some or all batches to augment the training set and re-rank
Any hybrid of the above, or any other strategy the participant chooses

Anticipated Logistics and Timeline

Now: Register for TREC.
Now: Practice collections and tools available.
Now: Participants must execute data usage agreement and acquire Group Key to access the "At Home" collections.
June 1, 2016: "At Home" collection athome4 made available. This collection includes a single dataset, and 34 topics. A smaller "Category B" collection -- athome4subset -- with a subset of 10 of the athome4 topics will be made available for teams who are unable to complete all 34 topics. Teams must complete either athome4subset or athome4.
June through August, 2016: Participants conduct "At Home" runs. Participants may conduct up to six runs, but each run must be independent -- no information leakage is permitted from one run to the next -- and every run will be scored.
August 31, 2016: Closing date for "At Home" participation.
September 7, 2016: Deadline to submit a Virtual Machine for "Sandbox" participation. Participants may submit either:
- A set of scripts to be run by the BMI virtual machine, or
- Their own Virtual Machine.
Further instructions and tools will be provided to facilitate submission.
October, 2016 (TBD): Deadline to submit participant overview papers for inclusion in the November 2016 TREC workshop notebook (disseminated only to participants)
November 15-18, 2016: TREC 2016 workshop in Gaithersburg, MD. Participants must conduct at least one run in order to attend.
Early 2017: Deadline for submission of final papers for TREC 2016 proceedings.

Details: Automatic At-Home Participation

Each participant may conduct up to six automatic full or limited experiments, with each experiment applying a particular fully automated method to test athome4 (or athome4subset). Participants should use a meaningful name (of their own choosing) for each experiment, and enter that name as the "RUNNAME" in the Baseline Model Implementation (BMI) configuration file, or as the ":alias" parameter in the API, or as "Run Alias" when using the manual Web interface.

NOTE: Once an athome test run is created, its results will become part of the official TREC 2016 record. It is not possible to start over or to expunge a run.

Automatic experiments may interact with the assessment server either directly using its API, or using the code provided in the BMI.

Participants must certify that, for each automatic experiment:

No modification or configuration of participants' system was done after a run (either automatic or manual) was created for any athome test; and
No modification or configuration of participants' system was done after any member of the participant group became aware of the topics or contents of documents used for either the Total Recall or Dynamic Domain tasks.

In other words, automatic experiments must use software that, without human intervention, downloads the dataset and conducts the task end to end.

Details: Call your Shot

If you download BMI on or after May 24, 2016, it will include the default "call your shot" whose implementation is shown below. You may modify the call-your-shot rule (and any other aspect of BMI) as you wish.

To implement the default rule for "call your shot" in a version of BMI downloaded prior to May 24, 2016, please modify the BMI implementation as follows:

find the file named doit-batch in folder structure vmscripts/cmd.dir/trec/doit-batch
either replace or patch the doit-batch:

To replace the file, right-click this link http://plg.uwaterloo.ca/~gvcormac/trecvm/vmscripts/cmd.dir/trec/doit-batch and select <i>save as</i>
To patch the file

note that this code already exists in BMI shared-drive script vmscripts/cmd.dir/trec/doit-batch

                 NDUN=$((NDUN+L))
                 L=$((L+(L+9)/10))

immediately following the code above, insert

                 if [ "$TOPIC" != "$REASONABLE" -a "`sort new*.$TOPIC | join -v1 - prel.$TOPIC | wc -l`" -ge $((1000+`cat prel.$TOPIC | wc -l`/2)) ] ; then
                     curl -X POST "$TRSERVER/judge/shot/$LOGIN/$TOPIC/reasonable"
                     echo "Called shot REASONABLE for topic $TOPIC at $NDUN" >> $LOG.$LOGIN
                     REASONABLE="$TOPIC"
                 fi

Details: Manual At-Home Participation

Each group may conduct one Manual At-Home experiment (whether or not they also conduct Automatic At-Home experiments). Participants conducting both manual and automatic experiments must ensure that the software to conduct their automatic experiments is frozen prior to creating any manual run.

Participants are required to track the nature and quantity of any manual effort, and to submit this information before the end of the At-Home phase.

The coordinators envision that manual participants may engage in some or all of the following activities:

Dataset-specific processing, formatting, or indexing;
Topic-specific searching within the dataset;
Consultation of external resources such as the Web, or individuals familiar with the subject matter of the topics;
Manual review of documents.

Participants are required to report the nature of these activities, to estimate the number of hours spent, on average, per topic, and to report the number of documents reviewed, per topic. Participants are required to submit all manually reviewed documents to the assessment server, so that they may be accounted for as "review effort."

NOTE: Mmanual participants, whether or not they manually review documents, may still avail themselves of assessments through the assessment server, using the TREC-supplied "Manual" interface, or using the API or BMI.

At-Home Usage Agreements, and GroupID activation

Each participant will be assigned an extended Group ID, which must be activated in order to conduct At-Home experiments. The GroupID will have the form GGG.XXXX where GGG is the GroupID used for practice, and XXXX is a randomly generated suffix.

To gain access to athome1 (for testing purposes) and athome4 or athome4subset (for submission), participants must sign the "TREC Total Recall Usage Agreement" and return a pdf of the signed agreement to the TREC Total Recall coordinators.

To gain access to athome2 and athome3 (for testing purposes), participants must submit the "TREC Dynamic Domain Usage Agreement," to NIST and forward the email confirming NIST's acceptance of that agreement to the Total Recall coordinators.

NOTE: Participants do not need to download the Dynamic Domain datasets to participate in the Total Recall 2016 Track; but if they want to use them for testing purposes, they need to obtain permission.

Participant Questionnaire

For each experiment, participants will be required to respond to a questionnaire containing questions such as the following:

Assigned TREC Group ID
Name of experiment [Each experiment should have a separate name, which should be used consistently as "RUNNAME" for BMI users, ":alias" for API users, or "Run Alias" for manual Web interface users; A participating group may submit up to six automatic experiments, and one manual experiment.]
Manual or Automatic?

If Manual,

Did your team manually review any documents? If so,

How many documents did you review, on average, per topic?

How many person-hours were spent reviewing, on average, per topic?

Did you search the dataset using manually formulated queries? If so,

How many queries did you compose, on average, per topic?

How many person-hours were spent searching, per topic?

Did your team have any personal knowledge of the topics? If so, please elaborate.

Did your team gather any information from any third parties or third-party resources (e.g., Wikipedia, Google, consultants)? If so,

What resources did you use?
How many person-hours were spent consulting/searching these resources?

If Automatic,

Did your system import major semantic resources such as dictionaries, ontologies, or training datasets? If so,

What resources did you import?
Approximately how large (e.g., GB) were the imported resources?

Did your system access external resources (e.g., Wikipedia, Google)? If so, please describe.

Please give a brief description of the hypothesis and methods employed in this experiment, with particular emphasis on how it differs from other experiments for which you are submitting results.

Sandbox Datasets and Topics

Sandbox submissions will be run by the TREC Total Recall coordinators (or their delegates) on private datasets. One of the datasets that will be used consists of 2.2M email messages from the administrations of two senior elected officials, which have previously been classified according to six topics of interest, not unlike the athome4 collections.

The second dataset will consist of 800,000 Twitter "tweets," classified according to four topics of interest.

Further details on Sandbox submission requirements will be available prior to the Sandbox submission deadline of September 7, 2016.

Unofficial Runs

Once participants have completed their experiments, there will be a facility for them to download a log of their submissions, as well as the official relevance assessments. Tools that compute various summary evaluation results will be provided. Participants may use this information to conduct unofficial experiments exploring "what if?" scenarios.

The TREC 2016 Workshop

Participants who conduct at least one experiment (At-Home automatic, At-Home manual, or Sandbox) are eligible to attend the TREC 2016 workshop in November, to have a paper included in the TREC 2016 workbook, and to have a paper included in the final TREC 2016 proceedings. Participants may also present a poster at TREC, and may be invited to speak.