TREC 2006 Terabyte Track Guidelines

Timetable (Revised June 8)

     Documents available:                Now
     Named Page Topic Creation:          May 15-29
     Efficiency topics released:         June 8, 2006
     Efficiency results due at NIST:     June 20
     Adhoc topics released:              June 21
     Named page finding topics released: June 21
     Adhoc results due at NIST:          July 2
     Named page finding results due:     July 30
     Comparative efficiency runs due:    August 7 
     Conference notebook papers due:     late October
     TREC 2006 conference:               November 14-17

Major Changes for 2006

If you did not participate in the TREC 2005 track , you can skip this section, which summarizes the major changes from last year:

We are strongly encouraging the submission of adhoc manual runs, as well as runs using pseudo-relevance feedback and other query expansion techniques. Our goal is increase the diversity of the judging pools in order to a create a more re-usable test collection. The run contributing the most unique relevant documents to the judging pool will receive special recognition (and a prize).
Topics for the the named page finding task will be created by the task participants, with each group planning to submit a named page finding run creating at least 12 topics.
The experimental procedure for the efficiency track has been re-defined to permit more realistic intra- and inter-system comparisons, and to generate separate measurements of latency and throughput. In order to compare systems across various hardware configurations, comparative runs using publicly available search engines are encouraged.

Overview

The primary goal of the Terabyte Track is to develop an evaluation methodology for terabyte-scale document collections. In addition, we are interested in efficiency and scalability issues, which can be studied more easily in the context of a larger collection. Again this year, we are using a 426GB collection of Web data from the gov domain for all tasks. While this collection is less than a full terabyte in size, it is considerably larger than the collections used in previous TREC tracks. In future years, we hope to expand the collection using data from other sources.

Again this year, the main track task is classic adhoc retrieval. All participants are expected to submit at least one run for this task. In addition, there are two optional tasks: a named page finding task and an efficiency task.

Collection

All tasks in this year's track will use a collection of Web data crawled from Web sites in the gov domain during early 2004. This collection ("GOV2") contains a large proportion of the crawlable pages in gov, including html and text, plus the extracted text of pdf, word and postscript files. The collection is 426GB in size and contains 25 million documents.

For TREC 2004, the collection was distributed by CSIRO in Australia. From TREC 2005 forward, this collection is available from the University of Glasgow. The collection has not changed in any way. If you participated in the track during 2004 or 2005, and obtained a copy of the GOV2 collection from CSIRO or the University of Glasgow, you do not need to obtain a new copy of the collection this year.

Topics and Queries

New topics for all three tasks will be released by NIST according to the timetable above. For the main adhoc task, queries may be created automatically or manually from these topic statements. For named page finding and efficiency tasks, queries must be created automatically. Automatic methods are those in which there is no human intervention at any stage, and manual methods are everything else. Topics from previous years (along with task guidelines and relevance judgments) are available from the TREC data archive and may be used for training.

Adhoc Task

An adhoc task in TREC investigates the performance of systems that search a static set of documents using previously-unseen topics. For each topic, participants create a query and submit a ranking of the top documents for that topic (10,000 for this task). NIST will create and assess new 50 topics for the task.

For most runs, you may use any or all of the topic fields when creating queries from the topic statements. For this task only, you may submit both automatic and manual runs. Each group submitting any automatic run must submit an automatic run that uses just the title field of the topic statement. Manual runs are strongly encouraged, since these runs often add relevant documents to the evaluation pool that are not found by automatic systems using current technology. In order to encourage the submission of manual runs, as well as runs using pseudo-relevance feedback and other query expansion techniques, the group submitting the run that contributes the most unique relevant documents to the retrieval pool will be awarded a prize appropriately recognizing this achievement.

An experimental run consists of the top 10,000 documents for each topic. Groups may submit up to five runs for the adhoc task. At least one automatic and one manual run from each group will be judged by NIST assessors; NIST may judge additional runs per group depending upon available assessor time. During the submission process you will be asked to rank your submissions in the order that you want them judged. If you give conflicting rankings across your set of runs, NIST will choose the run to assess arbitrarily. The judgments will be on a three-way scale of "not relevant", "relevant", and "highly relevant".

The format for submissions is given in a separate section below. Each topic must have at least one document retrieved for it. You may return fewer than 10,000 documents for a topic, though note that the standard evaluation measures used in TREC count empty ranks as not relevant. You cannot hurt your score, and could conceivably improve it for these measures, by returning 10,000 documents per topic.

In addition to the top 10,000 documents, we will be collecting information about each system and each run, including hardware characteristics and performance measurements, including total query processing time. Details are given in a separate section below. Be sure to record the required information when you generate your experimental runs, since it will be requested on the submission form.

For query processing time, report the time to return the top 20 documents, not the time to return the top 10,000. It is acceptable to execute your system twice for each query, once to generate the top 10,000 documents and once to measure the execution time for the top 20, provided that the top 20 results are the same in both cases.

Named Page Finding Task

Users sometimes search for a page by name. In such cases, an effective search system will return that page at or near rank one. In many cases there is only one correct answer. In other cases, any document from a small set of "near duplicates" is correct.

Systems will be compared on the basis of the rank of the first correct answer. Reported measures will include mean reciprocal rank of first correct answer and success rate at N for N = 1, 5 and 10. Success rate is defined as the percentage of cases in which the correct answer or equivalent URL occurred in the first N documents.

Roughly 150 new topics will be created for this task by the task participants. Guidelines for topic creation are given on a separate page.

Groups may submit up to four runs. A run consists of the top 1000 documents for each topic. For each run, groups should record and report the system characteristics described below. As with the adhoc task, the reported query processing time should be reported for the the top 20 documents. No manual or interactive query modification is permitted in this task.

Efficiency Task

The efficiency task extends the ad-hoc and the named page finding tasks. Its purpose is to enable the community to evaluate retrieval systems efficiency-wise and compare experimental performance results with those reported by other groups, even if the experiments were conducted using different hardware configurations.

Two weeks before the new topics for the ad-hoc task become available, we will release a set of 100,000 efficiency topics. This topic set will be a mix of search queries mined from query logs of a web search engine and of the title fields of TREC topics, including ad-hoc and named page finding topics, both from this year and from previous years. From the system's point of view, there will be no distinction between the individual topic types. Queries must be created automatically from these topics; manual runs are not permitted for this task.

The efficiency topic set is distributed in 4 separate files, representing 4 independent query streams. Queries within the same stream must be processed sequentially in the order in which they appear in the respective topic file. Processing of each query in a stream must be completed before processing of the next query is started. Queries from different query streams may be processed concurrently or interleaved in any arbitrary order. The existence of independent query streams will allow systems to take better advantage of parallelism and I/O scheduling.

Each participating group will run their system on the entire topic set (all four streams), reporting the top 20 documents for each topic, the average query processing time L for each topic (query processing latency), and the total time T between reading the first topic and writing the last result set (used to calculate query throughput). The total time should be reported without taking into account system startup times. Query processing latency is reported as the time between reading the first byte of a topic from the input file and writing the last byte of the search results into the output file. Thus, if the 4 query streams are processed sequentially, we get T = L * 100,000.

Before the start of the runs, all caches and memory must be in a state which is independent of the queries in the topic set. The pre-loading of information into caches and memory (e.g. by processing of "warm-up" queries) is permitted, provided that this procedure was fully determined prior to the download of the query set from NIST. When executing multiple runs, the system should be re-set to a state which is independent of the query set. For many systems, this may be achieved by a re-boot followed by the predetermined "warm up" procedure.

Each group may submit up to four runs (plus an additional run for each of the baseline systems, as discussed below). The search results for at least one of these runs will be judged by NIST assessors. For each run, the top 20 search results and basic performance figures, as outlined above, will be reported. Search engine effectiveness will be evaluated using these top 20 results. Precision at 20 documents will be the primary measure for the ad-hoc topics; MRR for the named page finding topics.

Participants are strongly encouraged to submit at least one run that was conducted in a single-processor configuration, with the four query streams being processed sequentially, (i.e., without any parallelism) in order to minimize latency. Participants are also strongly encouraged to produce additional runs using one (or more, if time permits) of three publicly available information retrieval systems (Indri, Wumpus, Zettair). These systems are available for download along with a detailed description of how to build and run the respective system. For each such comparative run, only the total execution time (once for index construction, once for query processing) needs to be reported. Comparative runs are due four weeks after the general deadline for efficiency runs.

Submissions

For all tasks, the submission form requires each group to report the following details about their hardware configuration and system performance: 1) percentage of document collection indexed, 2) indexing time in minutes, 3) total query processing time (top 20 documents), 4) number of processors, 5) total RAM, 6) size of on-disk file structures, 7) hardware cost, and 8) year of purchase.

For the number of processors, report the total number of CPUs in the system. For example, if your system is a cluster of eight dual-processor machines, you would report 16. For the hardware cost, provide an estimate in US dollars of the cost at the time of purchase.

Some groups may subset the collection before indexing, removing selected pages to reduce its size. The submission form asks for the fraction of pages indexed. If you did not subset the collection before indexing, report 100%.

The submission form will also collect basic query processing information for each run, including the topic fields from which the query was derived, as well as the use of link information and document structure.

Submission Formats

All runs must be compressed (gzip or bzip2).

For all tracks, a submission consists of a single ASCII text file in the format used for most TREC submissions, which we repeat here for convenience. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.

       630 Q0 ZF08-175-870  1 4238 prise1
       630 Q0 ZF08-306-044  2 4223 prise1
       630 Q0 ZF09-477-757  3 4207 prise1
       630 Q0 ZF08-312-422  4 4194 prise1
       630 Q0 ZF08-013-262  5 4189 prise1
          etc.

where:

the first column is the topic number.
the second column is the query number within that topic. This is currently unused and should always be Q0.
the third column is the official document number of the retrieved document and is the number found in the "docno" field of the document.
the fourth column is the rank the document is retrieved, and the fifth column shows the score (integer or floating point) that generated the ranking. This score MUST be in descending (non-increasing) order and is important to include so that we can handle tied scores (for a given run) in a uniform fashion (the evaluation routines rank documents from these scores, not from your ranks). If you want the precise ranking you submit to be evaluated, the SCORES must reflect that ranking.
the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used. That is, each run should have a different tag that identifies the group and the method that produced the run. Please change the tag from year to year, since often we compare across years (for graphs and such) and having the same name show up for both years is confusing. Also run tags must contain 12 or fewer letters and numbers, with *NO* punctuation, to facilitate labeling graphs with the tags.

Last updated: 08-Jun-2006
Date created: 29-Apr-2006
claclarke@plg.uwaterloo.ca