TREC 2015 Total Recall Track -- Supplemental Guidelines

June 22, 2015

These guidelines expand on the Initial Draft Guidelines of May 3, 2015. Questions concerning the Total Recall Track may be directed to, and will be answered in, the Total Recall Discussion Group.

The "At-Home" phase of the Total Recall Track begins in one week (Monday, June 29) and extends through August 31. At the end of the At-Home phase, participants may submit a virtual machine implementation of their system for evaluation in the "Sandbox" phase. Groups may participate in At-Home, in Sandbox, or in both.

At-Home Datasets and Topics

The At-Home task will be divided into three "tests" (as defined in the guidelines) named athome1, athome2, and athome3. Each of the At-Home tests has ten topics and a single dataset, for a total of 30 topics. The datasets for the tests, respectively, contain 290,000 documents, 450,000 documents, and 900,000 documents.

Participants may conduct one or more full experiments, each of which consists of three runs -- one for each of athome1, athome2, and athome3 tests. Alternatively, groups with limited resources may conduct limited experiments, each consisting of a single run of athome1 test.

Access to the At-Home datasets, topics, and relevance assessments will be through the Assessment Server, which has been available, and will continue to be available throughout the At-Home task.

NOTE: the Assessment Server will be down on Sunday, June 28, in order to configure it with the At-Home tests.

Automatic At-Home Participation

Each participant may conduct up to six automatic full or limited experiments, with each experiment applying a particular fully automated method to test athome1, and, in the case of a full experiment, also to tests athome2 and athome3. Participants should use a meaningful name (of their own choice) for each experiment, and enter that name as the "RUNNAME" in the Baseline Model Implementation ("BMI") configuration file, or as the ":alias" parameter in the API, or as "Run Alias" when using the manual Web interface.

NOTE: once an athome test run is created, its results will become part of the official TREC record. It is not possible to start over or to expunge a run.

Automatic experiments may interact with the assessment server either directly using its API, or using the code provided in the BMI.

Participants must certify that, for each automatic experiment:

No modification or configuration of participants' system was done after a run (either automatic or manual) was created for any athome test; and
No modification or configuration of participants' system was done after any member of the participant group became aware of the topics or contents of documents used for either the Total Recall or Dynamic Domain tasks.

In other words, automatic experiments must use software that, without human intervention, downloads the dataset and conducts the task end to end.

Manual At-Home Participation

Each group may conduct one Manual At-Home experiment (whether or not they also conduct Automatic At-Home experiments). Participants conducting both manual and automatic experiments must ensure that the software to conduct their automatic experiments is frozen prior to creating any manual run.

Participants are required to track the nature and quantity of manual effort, and to submit this information before the end of the At-Home phase.

The coordinators envision that manual participants may engage in some or all of the following activities:

Dataset-specific processing, formatting, or indexing;
Topic-specific searching within the dataset;
Consultation of external resources such as the Web, or individuals familiar with the subject matter of the topics;
Manual review of documents.

Participants are asked to report the general nature of these activities, to estimate the number of hours spent, on average, per topic, and to report the number of documents reviewed, per topic.

NOTE: manual participants, whether or not they manually review documents, may still avail themselves of assessments through the assessment server, using the TREC-supplied "Manual" interface, or using the API or BMI.

At-Home Usage Agreements, and GroupID activation

Each participant will be assigned an extended Group ID, which must be activated in order to conduct At-Home experiments. The GroupID will have the form GGG.XXXX where GGG is the GroupID used for practice, and XXXX is a randomly generated suffix.

To gain access to athome1, participants must sign the "TREC Total Recall Usage Agreement" and return a pdf of the signed agreement to the TREC Total Recall coordinators.

To gain access to athome2 and athome3, participants must submit the "TREC Dynamic Domain Usage Agreement," to NIST and forward the email confirming NIST's acceptance of that agreement to the Total Recall coordinators.

NOTE: Participants do not need to download the Dynamic Domain datasets to participate in Total Recall; but they do need to obtain permission, as some of the Total Recall datasets are derived from the Dynamic Domain datasets. These derivative datasets will be supplied automatically to authorized participants by the Total Recall server.

Participant Questionnaire

For each experiment, participants will be required to respond to a questionnaire, containing these questions:

Assigned TREC Group ID
Name of experiment [Each experiment should have a separate name, which should be used consistently as "RUNNAME" for BMI users, ":alias" for API users, or "Run Alias" for manual Web interface users; A participating group may submit up to six automatic experiments, and 1 manual experiment.]
Is this a full experiment (consisting of runs for three tests: athome1, athome2, and athome3) or a limited experiment (consisting of a run only for one test: athome1)?
Manual or Automatic?

If Manual,

Did your team manually review any documents? If so,

How many documents did you review, on average, per topic?

How many person-hours were spent reviewing, on average, per topic?

Did you search the dataset using manually formulated queries? If so,

How many queries did you compose, on average, per topic?

How many person-hours were spent searching, per topic?

Did your team have any personal knowledge of the topics? If so, please elaborate.

Did your team gather any information from third parties or third-party resources (e.g., Wikipedia, Google, consultants)? If so,

What resources did you use?
How many person-hours were spent consulting/searching these resources?

If Automatic,

Did your system import major semantic resources such as dictionaries, ontologies, or training datasets? If so,

What resources did you import?
Approximately how large (e.g., GB) were the imported resources?

Did your system access external resources (e.g., Wikipedia, Google)? If so, please elaborate.

Please give a brief description of the hypothesis and methods employed in this experiment, with particular emphasis on how it differs from other experiments for which you are submitting results.

Sandbox Datasets and Topics

Sandbox submissions will be run by the TREC Total Recall coordinators (or their delegates) on private datasets. One of the datasets that will be used consists of 400,000 email messages from the administration of a senior elected official, which have previously been classified according to statutory criteria by a professional archivist.

Other datasets and topics include test collections that are available to, but cannot be disseminated by, the TREC Total Recall coordinators.

Further details on Sandbox submission requirements will be available prior to the Sandbox submission deadline of September 1.

Unofficial Runs

Once participants have completed their experiments, there will be a facility for them to download a log of their submissions, as well as the official relevance assessments. Tools that compute various summary evaluation results will be provided. Participants may use this information to conduct unofficial experiments exploring "what if?" scenarios.

Official Evaluation

The TREC 2015 Total Recall Track will report a number of evaluation measures that reflect how nearly all of the relevant documents are found (i.e., completeness), as a function of the number of documents submitted to the assessment server (i.e., effort).

Generally, these measures may be grouped into "Rank measures" and "Set measures." Rank measures reflect completeness for various effort values; for example, as a gain curve, or as a summary measure such as "effort to achieve 80% recall."

Set measures, on the other hand, reflect completeness and effort at a fixed level of effort, specified by the participant's system during the run. Three such fixed levels of effort will be used: "70% (estimated) recall," "80% (estimated) recall," and "best effort." Participating runs use the "call your shot" interface to specify the point at which they estimate, respectively:

that, with high probability, at least 70% recall has been achieved;
that, with high probability, at least 80% recall has been achieved;
that, according to the participant's own criteria, the most reasonable and proportionate compromise between completeness and effort has been achieved; in other words, if the participant were paying for this effort, when the cost of continuing the search would outweigh its likely benefit.

Participants who do not use the "call your shot" interface will receive no score for the set-based measures, but will be scored according to the rank-based measures.

A number of different "completeness" measures will be reported. The most obvious measure of completeness is total recall -- the fraction of relevant documents that have been submitted to the assessment server.

To mitigate known shortcomings of recall, facet-based recall will also be reported, for various facets. A facet is an identifiable subpopulation of documents; for example, highly relevant documents, documents reflecting a particular subtopic, documents of a particular type, documents from a particular subcollection, etc.

The objective is to achieve a high level of completeness on each facet, regardless of how the facets may be defined.

The TREC 2015 Workshop

Participants who conduct at least one experiment (At-Home automatic, At-Home manual, or Sandbox) are eligible to attend the TREC workshop in November, to have a paper included in the TREC workbook, and to have a paper included in the final TREC proceedings. Participants may present a poster at TREC, and may be invited to speak.