TREC 2015 Total Recall Track

Initial Draft Guidelines:  May 3, 2015

Task Overview

Track participants will implement automatic or semi-automatic methods to identify as many relevant documents as possible, with as little review effort as possible, from document collections containining as many as 1 million documents.

The document collection, information need, and an automated relevance assessor will be supplied to participants via an on-line server. After downloading the collection and information need, participants must identify documents from the collection and submit them (in batches whose size is determined by the participant) to the on-line relevance assessor.  Every document submitted to the assessor is scored, and an authoritative assessment of relevance for each document is returned immediately to the participant, for each document in each batch, as it is submitted.  To accomplish this, the Total Recall coordinators are using collections in which every document has been pre-labeled as relevant or not and the automated assessor merely provides that label to the participant.

The objective is to submit as many documents containing relevant information as possible, while submitting as few documents as possible, to the automated relevance assessor.

Motivating Applications

The Total Recall Track addresses the needs of searchers who want to find out everything about X, for some X. Typical examples include:

Evaluation Measures

There are many possible definitions for "as many documents containing relevant information as possible" and "as few documents as possible." The Total Recall Track will report traditional as well as novel measures to weigh the tradeoff between information found and effort expended.

Traditional measures will include those based on Recall and Precision, such as Recall-Precision curves, Average Precision, R-Precision, and F1.

Additional measures will include gain curves and associated measures, which track Recall as a function of effort, where effort is defined as the number of documents submitted to the automated relevance assessor.

A new "facet-oriented" Recall measure will be introduced, to measure how effectively participants are able to achieve high recall over sub-topics and sub-collections.

Participants will also be evaluated on how well they are able to estimate the ongoing effectiveness of their runs. For example, participants will have the opportunity to specify, at some point between submissions of batches of documents to the automated assessor, that they believe they have achieved a specified goal, e.g., the optimal value for a measure such as F1.

The measures actually achieved will be calculated by the on-line server. Summary recall/precision/effort results will be available to participants once their runs are complete, but detailed results will be presented to participants at TREC in November, and to the public in the TREC proceedings, to be published in early 2016.

"Play-at-Home" vs. "Sandbox" Evaluation

Practice collections and topics are available now to all registered TREC 2015 participants. A baseline model implementation "BMI" of a fully automated approach is available now for experimental purposes.

The server for some of the collections will be available to participants via the Internet, subject to the execution of a usage agreement.  For these collections, participants will run their own systems, and access the automated assessor via the Internet ("Play-at Home" participation.) No prior experimentation or practice on these collections is permitted; all runs will be logged and reported in the TREC 2015 proceedings.

Participants must declare each run to be either "automatic," meaning that no manual intervention was used once the collection was downloaded, or "manual," meaning that manual intervention -- whether parameter tweaking or full-scale document review -- was involved. If multiple runs are conducted, every run must be independent; under no circumstances may information learned from one run be used in any other. Play-at-Home participants will be required to complete a short questionnaire describing the nature and quantity of manual effort involved in each run.

To preserve the confidentiality of sensitive information, the server for some of the collections will be available only within a firewalled platform with no Internet access. Participants wishing to evaluate their systems on these datasets must submit a fully automated solution, which the Track coordinators will execute as a virtual machine within a restricted environment ("Sandbox" participation.)

The baseline model implementation ("BMI") supplied by the 2015 Total Recall Track is suitable for "Sandbox" as well as automatic "Play-at-Home"participation, and participants are free to modify it as they see fit, subject to the Gnu Public License (GPL v.3.).

Participants may submit their own virtual machine, perhaps containing proprietary software. In this case, participants must warrant that they have the right to use the software in this way, and the Track coordinators will in turn warrant that the submission will be used only for the purpose of evaluation within the sandbox.

Possible Strategies

Anticipated Logistics and Timeline