The document collection, information need, and an automated
relevance assessor will be supplied to participants
via an on-line server. After downloading the collection and
information need, participants must identify documents from
the collection and submit them (in batches whose size is
determined
by the participant) to the on-line relevance assessor. Every
document submitted to the assessor is scored,
and an authoritative assessment of relevance for each document
is returned immediately to the participant, for each document
in each batch, as it is submitted. To accomplish this, the
Total Recall coordinators are using collections in which every
document has been pre-labeled as
relevant or not and the automated assessor merely provides that
label to the participant.
The objective is to submit as many documents containing relevant information as possible, while submitting as few documents as possible, to the automated relevance assessor.
Traditional measures will include those based on Recall and Precision, such as Recall-Precision curves, Average Precision, R-Precision, and F1.
Additional measures will include gain curves and associated measures, which track Recall as a function of effort, where effort is defined as the number of documents submitted to the automated relevance assessor.
A new "facet-oriented" Recall measure will be introduced, to measure how effectively participants are able to achieve high recall over sub-topics and sub-collections.
Participants will also be evaluated on how well they are able to estimate the ongoing effectiveness of their runs. For example, participants will have the opportunity to specify, at some point between submissions of batches of documents to the automated assessor, that they believe they have achieved a specified goal, e.g., the optimal value for a measure such as F1.
The measures actually achieved will be calculated by the on-line server. Summary recall/precision/effort results will be available to participants once their runs are complete, but detailed results will be presented to participants at TREC in November, and to the public in the TREC proceedings, to be published in early 2016.
The server for some of the collections will be available to participants via the Internet, subject to the execution of a usage agreement. For these collections, participants will run their own systems, and access the automated assessor via the Internet ("Play-at Home" participation.) No prior experimentation or practice on these collections is permitted; all runs will be logged and reported in the TREC 2015 proceedings.
Participants must declare each run to be
either "automatic," meaning that no manual intervention
was used once the collection was downloaded, or "manual,"
meaning that manual intervention -- whether parameter tweaking
or full-scale document review -- was involved.
If multiple runs are conducted, every run must be independent;
under no circumstances may information learned from one run be
used in any other. Play-at-Home participants will be required to
complete a short questionnaire describing the nature and quantity
of manual effort involved in each run.
To preserve the confidentiality of sensitive information, the server for some of the collections will be available only within a firewalled platform with no Internet access. Participants wishing to evaluate their systems on these datasets must submit a fully automated solution, which the Track coordinators will execute as a virtual machine within a restricted environment ("Sandbox" participation.)
The baseline model implementation ("BMI") supplied by the 2015 Total Recall Track is suitable for "Sandbox" as well as automatic "Play-at-Home"participation, and participants are free to modify it as they see fit, subject to the Gnu Public License (GPL v.3.).
Participants may submit their own virtual machine, perhaps containing proprietary software. In this case, participants must warrant that they have the right to use the software in this way, and the Track coordinators will in turn warrant that the submission will be used only for the purpose of evaluation within the sandbox.