TREC 2012 Web Track Guidelines (DRAFT)

Charles Clarke, University of Waterloo
Nick Craswell, Microsoft Research
Ellen Voorhees (NIST Contact)

Welcome to the TREC 2012 Web Track. Our goal is to explore and evaluate Web retrieval technologies over the billion-page ClueWeb09 Dataset. This year the track will continue the adhoc retrieval and diversity tasks from 2009, 2010, and 2012.

We assume you arrived at this page because you're participating in this year's TREC conference. If not, you should start at the TREC main page.

If you're new to the TREC Web Track, you may want to start by reading the track overview papers from TREC 2009, TREC 2010, and TREC 2011. Take note of the size of the collection, which is roughly 25TB uncompressed. We suggest that you obtain the collection and start working with it as soon as possible. It's a lot of data. To help you out a little, the Waterloo spam ranking from 2010 remains available.

If you participated last year, little has changed this year. We had hoped to have a new collection available, but it won't be finished in time for this year's track. The main change from last year is that we will be sharing topics with the NTCIR diversity task, which we hope will provide new insights into evaluation methods for result diversification. Runs may still be submitted over both the Category A and B collections. All judged runs will be fully judged according to both the adhoc and diversity criteria to some minimum depth k ≥ 10.

If you're planning to participate in the track, you should be on the track mailing list. If you're not on the list, send a mail message to listproc (at) nist (dot) gov such that the body consists of the line "subscribe trec-web FirstName LastName".


Exact dates will be announced soon, but the rough schedule is:
  • Corpus available: now
  • Topics available: June 13
  • Submissions due: August 9
  • Results available: September 30
  • TREC 2010 conference: November 6-9, Gaithersburg, Maryland


Web Tracks at TREC have explored specific aspects of Web retrieval, including named page finding, topic distillation, and traditional adhoc retrieval. Starting in 2009 we introduced a diversity task that combines aspects of all these older tasks. The goal of this diversity task is to return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the result list. We continue both the diversity task and a traditional adhoc task for TREC 2012.

The adhoc and diversity tasks share topics, which will be developed with the assistance of information extracted from the the logs of a commercial Web search engine. Topic creation and judging will attempt to reflect a mix of genuine user requirements for the topic. See below for example topics.

Document Collection

The track will again use the ClueWeb09 dataset as its document collection. The full collection consists of roughly 1 billion web pages, comprising approximately 25TB of uncompressed data (5TB compressed) in multiple languages. The dataset was crawled from the Web during January and February 2009.

Further information regarding the collection can be found on the associated Website. Since it can take several weeks to obtain the dataset, we urge you to start this process as soon as you can. The collection will be shipped to you on four 1.5TB hard disks at an expected cost of US$790 plus shipping charges.

If you are unable to work with the full dataset, we will accept runs over the smaller ClueWeb09 "Category B" dataset, but we strongly encourage you to use the full "Category A" dataset if you can. The Category B dataset represents a subset of about 50 million English-language pages. The Category B dataset can be ordered through the ClueWeb09 Web. It will be shipped to you on a single 1.0TB hard disk at an expected cost of US$240 plus shipping charges.

Adhoc Task

An adhoc task in TREC investigates the performance of systems that search a static set of documents using previously-unseen topics. The goal of the task is to return a ranking of the documents in the collection in order of decreasing probability of relevance. The probability of relevance of a document is considered independently of other documents that appear before it in the result list. For each topic, participants submit a ranking of the top 10,000 documents for that topic.

NIST will create and assess new 50 topics for the task, but NIST will not release the full topics to the participants until after runs are submitted. Instead, the initial release of the topic will consist of 50 queries (the topic "titles" in the traditional TREC jargon). No other information regarding the topics will be provided as part of the initial release. An experimental run consists of the top 10,000 documents for each of these topics.

The process of generating an experimental run may be either "manual" or "automatic". For automatic runs, the process of executing the queries over the documents and generating the experimental run should be entirely automated. There should be no human intervention at any stage, including modifications to your retrieval system motivated by an inspection of the queries. For automatic runs, you should not materially modify your retrieval system between the time you download the queries and the time you submit your runs. Runs not satisfying these criteria are considered to be manual runs, even if the human intervention is very minor,e.g., a single step in a long process.

At least one run from each group will be judged by NIST assessors. Each document will be judged on a six-point scale, as follows:

This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.
This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
The content of this page provides substantial information on the topic.
The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
This page does not appear to be useful for any reasonable purpose; it may be spam or junk.
All topics are expressed in English. Non-English documents will be judged non-relevant, even if the assessor understands the language of the document and the document would be relevant in that language. If the location of the user matters, the assessor will assume that the user is located in Gaithersburg, Maryland.

Again this year the primary evaluation measure will be expected reciprocal rank (ERR) as defined by Chapelle et al. (CIKM 2009). In addition to ERR, we will compute and report a range of standard measures, including MAP, precision@10 and NDCG@10.

You may submit up to three runs for the adhoc task; at least one will be judged. NIST may judge additional runs per group depending upon available assessor time. During the submission process you will be asked to rank your submissions in the order that you want them judged. If you give conflicting rankings across your set of runs, NIST will choose the run to assess arbitrarily.

The format for submissions is given in a separate section below. Each topic must have at least one document retrieved for it. While many of the evaluation measures used in this track consider only the top 10-20 documents, some methods for estimating MAP sample at deeper levels, and we request that you return the top 10,000 to aid in this process. You may return fewer than 10,000 documents for a topic. However, you cannot hurt your score, and could conceivably improve it, by returning 10,000 documents per topic. All the evaluation measures used in the track count empty ranks as not relevant (Non).

Diversity Task

The diversity task is similar to the adhoc retrieval task, but differs in its judging process and evaluation measures. The goal of the diversity task is to return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the result list. For this task, the probability of relevance of a document is conditioned on the documents that appear before it in the result list.

For the purposes of the diversity track, each topic will be structured as a representative set of subtopics, each related to a different user need. Example are provided below. Documents will be judged with respect to the subtopics. For each subtopic, NIST assessors will make a binary judgment as to whether or not the document satisfies the information need associated with the subtopic.

Topics will be fully defined by NIST in advance of topic release, but only the query field will be initially released. Detailed topics will be released only after runs have been submitted. Subtopics will be based on information extracted from the logs of a commercial search engine, and will roughly balanced in terms of popularity. Strange and unusual interpretations and aspects will be avoided as much as possible.

Again this year, the primary evaluation measure for the diversity task will be intent aware expected reciprocal rank (ERR-IA). Developing and validating metrics for diversity tasks continues to be a goal of the track, and we will report a number of other evaluation measures that have been proposed over the past several years. Clarke et al. (WSDM 2011) provides a summary and analysis of many of these evaluation measures, including ERR-IA.

In all other respects, the diversity task is identical to the adhoc task. The same 50 topics will be used. The submission format is the same. The top 10,000 documents should be submitted. You may submit up to three runs, at least one of which will be judged.

Topic Structure

The topic structure will be similar to that used for the TREC 2009 topics. The topics below provide examples.

    <topic number="6" type="ambiguous">
      <description>Find information on the Kansas City Southern railroad.
      <subtopic number="1" type="nav">
        Find the homepage for the Kansas City Southern railroad.
      <subtopic number="2" type="inf">
        I'm looking for a job with the Kansas City Southern railroad.
      <subtopic number="3" type="nav">
        Find the homepage for Kanawha County Schools in West Virginia.
      <subtopic number="4" type="nav">
        Find the homepage for the Knox County School system in Tennessee.
      <subtopic number="5" type="inf">
        Find information on KCS Energy, Inc., and their merger with
        Petrohawk Energy Corporation.

    <topic number="16" type="faceted">
      <query>arizona game and fish</query>
      <description>I'm looking for information about fishing and hunting
      in Arizona.
      <subtopic number="1" type="nav">
        Take me to the Arizona Game and Fish Department homepage.
      <subtopic number="2" type="inf">
        What are the regulations for hunting and fishing in Arizona?
      <subtopic number="3" type="nav">
        I'm looking for the Arizona Fishing Report site.
      <subtopic number="4" type="inf">
        I'd like to find guides and outfitters for hunting trips in Arizona.

Initial topic release will include only the query field.

As shown in these examples, topics are categorized as either "ambiguous" or "faceted". Ambiguous queries are those that have multiple distinct interpretations. We assume that a user interested in one interpretation would not be interested in the others. On the other hand, facets reflect underspecified queries, with different aspects covered by the subtopics. We assume that a user interested in one aspect may still be interested in others.

Each subtopic is categorized as being either navigational ("nav") or informational ("inf"). A navigational subtopic usually has only a small number of relevant pages (often one). For these subtopics, we assume the user is seeking a page with a specific URL, such as an organization's homepage. On the other hand, an informational query may have a large number of relevant pages. For these subtopics, we assume the user is seeking information without regard to its source, provided that the source is reliable.

For the adhoc task, relevance is judged on the basis of the description field. For the diversity task, a document may not be relevant to any subtopic, even if it is relevant to the overall topic. The set of subtopics is intended to be representative, not exhaustive. We expect each topic to contain 4-10 subtopics.

Submission Format for Adhoc and Diversity Tasks

All adhoc and diversity task runs must be compressed (gzip or bzip2).

For both tasks, a submission consists of a single ASCII text file in the format used for most TREC submissions, which we repeat here for convenience. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.

    5 Q0 clueweb09-enwp02-06-01125 1 32.38 example2012
    5 Q0 clueweb09-en0011-25-31331 2 29.73 example2012
    5 Q0 clueweb09-en0006-97-08104 3 21.93 example2012
    5 Q0 clueweb09-en0009-82-23589 4 21.34 example2012
    5 Q0 clueweb09-en0001-51-20258 5 21.06 example2012
    5 Q0 clueweb09-en0002-99-12860 6 13.00 example2012
    5 Q0 clueweb09-en0003-08-08637 7 12.87 example2012
    5 Q0 clueweb09-en0004-79-18096 8 11.13 example2012
    5 Q0 clueweb09-en0008-90-04729 9 10.72 example2012


  • the first column is the topic number.
  • the second column is currently unused and should always be "Q0".
  • the third column is the official document identifier of the retrieved document. For documents in the ClueWeb0 collection this identifier is the value found in the "WARC-TREC-ID" field of the document's WARC header.
  • the fourth column is the rank the document is retrieved, and the fifth column shows the score (integer or floating point) that generated the ranking. This score must be in descending (non-increasing) order. The evaluation program ranks documents from these scores, not from your ranks. If you want the precise ranking you submit to be evaluated, the scores must reflect that ranking.
  • the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used. That is, each run should have a different tag that identifies the group and the method that produced the run. Please change the tag from year to year, since often we compare across years (for graphs and such) and having the same name show up for both years is confusing. Also run tags must contain 12 or fewer letters and numbers, with no punctuation, to facilitate labeling graphs with the tags.

Last updated: 30-May-2012
Date created: 17-Apr-2012