TREC 2015 Total Recall Track

Initial Draft Guidelines: May 3, 2015

Task Overview

Track participants will implement automatic or semi-automatic methods to identify as many relevant documents as possible, with as little review effort as possible, from document collections containining as many as 1 million documents.

The document collection, information need, and an automated relevance assessor will be supplied to participants via an on-line server. After downloading the collection and information need, participants must identify documents from the collection and submit them (in batches whose size is determined by the participant) to the on-line relevance assessor. Every document submitted to the assessor is scored, and an authoritative assessment of relevance for each document is returned immediately to the participant, for each document in each batch, as it is submitted. To accomplish this, the Total Recall coordinators are using collections in which every document has been pre-labeled as relevant or not and the automated assessor merely provides that label to the participant.

The objective is to submit as many documents containing relevant information as possible, while submitting as few documents as possible, to the automated relevance assessor.

Motivating Applications

The Total Recall Track addresses the needs of searchers who want to find out everything about X, for some X. Typical examples include:

Vanity search: find out everything about me
Fandom: find out everything about my hero
Research: find out everything about my PhD topic
Investigation: find out everthing about something or some activity
Systematic review: find all published studies evaluating some effect
Patent search: find all prior art
Electronic discovery: find all documents responsive to a request for production in a legal matter
Creating archival collections: label all relevant documents, for posterity, future IR evaluation, etc.

Evaluation Measures

There are many possible definitions for "as many documents containing relevant information as possible" and "as few documents as possible." The Total Recall Track will report traditional as well as novel measures to weigh the tradeoff between information found and effort expended.

Traditional measures will include those based on Recall and Precision, such as Recall-Precision curves, Average Precision, R-Precision, and F₁.

Additional measures will include gain curves and associated measures, which track Recall as a function of effort, where effort is defined as the number of documents submitted to the automated relevance assessor.

A new "facet-oriented" Recall measure will be introduced, to measure how effectively participants are able to achieve high recall over sub-topics and sub-collections.

Participants will also be evaluated on how well they are able to estimate the ongoing effectiveness of their runs. For example, participants will have the opportunity to specify, at some point between submissions of batches of documents to the automated assessor, that they believe they have achieved a specified goal, e.g., the optimal value for a measure such as F₁.

The measures actually achieved will be calculated by the on-line server. Summary recall/precision/effort results will be available to participants once their runs are complete, but detailed results will be presented to participants at TREC in November, and to the public in the TREC proceedings, to be published in early 2016.

"Play-at-Home" vs. "Sandbox" Evaluation

Practice collections and topics are available now to all registered TREC 2015 participants. A baseline model implementation "BMI" of a fully automated approach is available now for experimental purposes.

The server for some of the collections will be available to participants via the Internet, subject to the execution of a usage agreement. For these collections, participants will run their own systems, and access the automated assessor via the Internet ("Play-at Home" participation.) No prior experimentation or practice on these collections is permitted; all runs will be logged and reported in the TREC 2015 proceedings.

Participants must declare each run to be either "automatic," meaning that no manual intervention was used once the collection was downloaded, or "manual," meaning that manual intervention -- whether parameter tweaking or full-scale document review -- was involved. If multiple runs are conducted, every run must be independent; under no circumstances may information learned from one run be used in any other. Play-at-Home participants will be required to complete a short questionnaire describing the nature and quantity of manual effort involved in each run.

To preserve the confidentiality of sensitive information, the server for some of the collections will be available only within a firewalled platform with no Internet access. Participants wishing to evaluate their systems on these datasets must submit a fully automated solution, which the Track coordinators will execute as a virtual machine within a restricted environment ("Sandbox" participation.)

The baseline model implementation ("BMI") supplied by the 2015 Total Recall Track is suitable for "Sandbox" as well as automatic "Play-at-Home"participation, and participants are free to modify it as they see fit, subject to the Gnu Public License (GPL v.3.).

Participants may submit their own virtual machine, perhaps containing proprietary software. In this case, participants must warrant that they have the right to use the software in this way, and the Track coordinators will in turn warrant that the submission will be used only for the purpose of evaluation within the sandbox.

Possible Strategies

Extreme relevance feedback (manual): load documents into a search engine; search to find documents that appear relevant; submit documents to the automated assessor; repeat
Extreme relevance feedback (automatic): as above, but automate search construction and submission of results to the automated assessor
Supervised learning: identify a "training set" of documents; submit training set to the automated assessor and use the resulting assessments to train a supervised learning algorithm to rank the collection; submit batches of documents in rank order
Active learning: as above, but use the assessments returned for some or all batches to augment the training set and re-rank
Any hybrid of the above, or any other strategy the participant chooses

Anticipated Logistics and Timeline

Now: Register for TREC
Now: Practice collections and tools available.
June 2015: Participants must execute data usage agreement and acquire Group Key to access "Play-at-Home" collections.
June 29, 2015: "Play-at-Home" collections made available. These collections will include three or four datasets, with somewhere between 10 and 50 topics per dataset. A smaller "Category B" set of collections and topics will be identified for participants who lack the human or computational resources to do all topics.
July through August, 2015: Participants conduct "Play-at-Home" runs. Participants may conduct multiple runs, but each run must be independent -- no information leakage is permitted from one run to the next -- and every run will be scored.
August 31, 2015: Closing date for "Play-at-Home" participation.
September 1, 2015: Deadline to submit a Virtual Machine for "Sandbox" participation. Participants may submit either:
- A set of scripts to be run by the BMI virtual machine
- Their own Virtual Machine.
Further instructions and tools will be provided to facilitate submission.
October, 2015 (TBD): Deadline to submit participant overview papers for inclusion in the November 2015 TREC workshop notebook (disseminated only to participants)
November 17-20, 2015: TREC 2015 workshop in Gaithersburg, MD. Participants must conduct at least one run in order to attend.
Early 2016: Deadline for submission of final papers for TREC 2015 proceedings.