TREC 2009 Web Track Guidelines

Nick Craswell, Microsft Research
Charles Clarke, University of Waterloo
Ian Soboroff (NIST Contact)

Welcome to the to the TREC 2009 Web Track. Our goal is to explore and evaluate Web retrieval technologies over the new billion-page ClueWeb09 Dataset. The track will focus on a new diversity task, but will also include a traditional adhoc retrieval task.

We assume you arrived at this page because you're participating in this year's TREC conference. If not, you should start at the TREC main page.

If you're new to TREC, you may want to read the overview papers associated with the Terabyte Retrieval Track, which ran from 2004-2006, and the older Web Track, which ran from 1999-2003.

If you're a previous TREC participant, most of these guidelines may seem familiar. Only the diversity task is entirely new. Also, take note of the size of the collection. We suggest that you obtain the collection and start working with it as soon as possible. It's a lot of data.

If you're planning to participate in the track, you should be on the track mailing list. If you're not on the list, send a mail message to listproc (at) nist (dot) gov such that the body consists of the line "subscribe trec-web FirstName LastName".

Timetable

Corpus available: now
Topics available: June 15th (linked from the TREC Tracks Homepage)
Submissions due: August 19th before midnight EDT (submission form now available)
Results available: October 13
Notebook paper due: October 26
TREC 2009 conference: Nov 17-20, Gaithersburg, MD, USA

Goals

Previous Web tracks have explored specific aspects of Web retrieval, including named page finding, topic distillation, and traditional adhoc retrieval. This year, we are attempting a new task that combines aspects of all these older tasks. The goal of this new diversity task is to return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the result list.

For example, given the topic "windows", a system might return the Windows update page first, followed by the Microsoft home page, and then a news article discussing the release of Windows 7. Mixed in these results might be pages providing product information on windows and doors for homes and businesses.

Topics for this task will be developed from information extracted from the the logs of a commercial Web search engine. Topic creation and judging will attempt to reflect a mix of genuine user requirements for the topic.

Document Collection

The track will use the new ClueWeb09 dataset as its document collection. The full collection consists of roughly 1 billion web pages, comprising approximately 25TB of uncompressed data (5TB compressed) in multiple languages. The dataset was crawled from the Web during January and February 2009.

Further information regarding the collection can be found on the associated Website. Since it can take several weeks to obtain the dataset, we urge you to start this process as soon as you can. The collection will be shipped to you on four 1.5TB hard disks at an expected cost of US$790 plus shipping charges.

If you are unable to work with the full dataset, we will accept runs over the smaller ClueWeb09 "Category B" dataset. This dataset represents a subset of about 50 million English-language pages. The Category B dataset can be ordered through the ClueWeb09 Web. It will be shipped to you on a single 1.0TB hard disk at an expected cost of US$240 plus shipping charges.

Adhoc Task

An adhoc task in TREC investigates the performance of systems that search a static set of documents using previously-unseen topics. The goal of the task is to return a ranking of the documents in the collection in order of decreasing probability of relevance. The probability of relevance of a document is considered independently of other documents that appear before it in the result list. For each topic, participants submit a ranking of the top 1,000 documents for that topic.

NIST will create and assess new 50 topics for the task. Unlike previous years, NIST will not release the full topics to the participants until after runs are submitted. Instead, the initial release of the topic will consist of 50 queries (the topic "titles" in the usual TREC jargon). No other information regarding the topics will be provided as part of the initial release.

An experimental run consists of the top 1,000 documents for each topic query. The process of executing the queries over the documents and generating the experimental runs should be entirely automatic. There should be no human intervention at any stage, including modifications to your retrieval system motivated by an inspection of the queries. You should not materially modify your retrieval system between the time you download the queries and the time you submit your runs.

At least one run from each group will be judged by NIST assessors. Each document will be judged on a three-point scale as being "relevant", "highly relevant" or "not relevant" to the topic associated with the query. All topics are expressed in English. Non-English documents will be judged non-relevant, even if the assessor understands the language of the document and the document would be relevant in that language. As with previous TREC adhoc tasks, the primary evaluation measure will be mean average precision (MAP), as estimated by the statMAP procedure (also used by the Million Query Track). In addition to MAP, we will compute and report a range of standard measures.

You may submit up to three runs for the adhoc task. NIST may judge additional runs per group depending upon available assessor time. During the submission process you will be asked to rank your submissions in the order that you want them judged. If you give conflicting rankings across your set of runs, NIST will choose the run to assess arbitrarily.

The format for submissions is given in a separate section below. Each topic must have at least one document retrieved for it. You may return fewer than 1,000 documents for a topic, although all the evaluation measures used in the track count empty ranks as not relevant. You cannot hurt your score, and could conceivably improve it for these measures, by returning 1,000 documents per topic.

Diversity Task

The diversity task is similar to the adhoc retrieval task, but differs in its judging process and evaluation measures. The goal of the diversity task is to return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the result list. For this task, the probability of relevance of a document is conditioned on the documents that appear before it in the result list.

For the purposes of the diversity track, each topic will be structured as a representative set of subtopics, each related to a different user need. An example is provided below. Documents will be judged with respect to the subtopics. For each subtopic, NIST assessors will make a binary judgment as to whether or not the document satisfies the information need associated with the subtopic.

Topics will be fully defined by NIST in advance of topic release, but only the query field will be initially released. Detailed topics will be released only after runs have been submitted. Subtopics will be based on information extracted from the logs of a commercial search engine, and will roughly balanced in terms of popularity. Strange and unusual interpretations and aspects will be avoided as much as possible.

Since developing and validating metrics for this type of task is a goal of the track, the task we will explore a number of evaluation measures. These measures will include the α-nDCG measure proposed by Clarke et al. (SIGIR 2008) and intent aware MAP (MAP-IA) proposed by Agrawal et al. (WSDM 2009). Those papers should be consulted for more information.

In all other respects, the diversity task is identical to the adhoc task. The same 50 topics will be used. The submission format is the same. Query processing must be entirely automatic. You may submit up to three runs, at least one of which will be judged.

Example Topic

    <topic number=0>
        <query> physical therapist </query>
        <description>
            The user requires information regarding the
            profession and the services it provides.
        </description>
        <subtopic number=1> What does a physical therapist do? </subtopic>
        <subtopic number=2> Where can I find a physical therapist? </subtopic>
        <subtopic number=3> How much does physical therapy cost per hour? </subtopic>
        <subtopic number=4>
            What education or training does a physical therapist require?
            Where can I obtain this training?  How long does it take?
        </subtopic>
        <subtopic number=5>
            What is the American Physical Therapy Association?
            What is the URL of its Website?
        </subtopic>
        <subtopic number=6>
            How much do physical therapists earn? What is the starting salary?
            What is the average salary for an experienced therapist?
        </subtopic>
        <subtopic number=7>
            What is the difference between a occupational therapist and a
            physical therapist?
        </subtopic>
        <subtopic number=8>
            Information is required regarding physical therapist's assistants.
            What education do they require? How much do they make?
        </subtopic>
    </topic>

Initial topic release will include only the query field.

For the adhoc task, relevance is primarily judged on the basis of the description field, but if a document is relevant to a subtopic then it's usually relevant to the overall topic. However, for the diversity task, a document may not be relevant to any subtopic, even if it is relevant to the overall topic. The set of subtopics is intended to be representative, not exhaustive. We expect each topic to contain 4-10 subtopics.

Submission Format

All runs must be compressed (gzip or bzip2).

For all tracks, a submission consists of a single ASCII text file in the format used for most TREC submissions, which we repeat here for convenience. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.

       630 Q0 ZF08-175-870  1 4238 prise1
       630 Q0 ZF08-306-044  2 4223 prise1
       630 Q0 ZF09-477-757  3 4207 prise1
       630 Q0 ZF08-312-422  4 4194 prise1
       630 Q0 ZF08-013-262  5 4189 prise1
          etc.

where:

the first column is the topic number.
the second column is currently unused and should always be "Q0".
the third column is the official document identifier of the retrieved document. For documents in the ClueWeb0 collection this identifier is the value found in the "WARC-TREC-ID" field of the document's WARC header.
the fourth column is the rank the document is retrieved, and the fifth column shows the score (integer or floating point) that generated the ranking. This score MUST be in descending (non-increasing) order and is important to include so that we can handle tied scores (for a given run) in a uniform fashion (the evaluation routines rank documents from these scores, not from your ranks). If you want the precise ranking you submit to be evaluated, the scores must reflect that ranking.
the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used. That is, each run should have a different tag that identifies the group and the method that produced the run. Please change the tag from year to year, since often we compare across years (for graphs and such) and having the same name show up for both years is confusing. Also run tags must contain 12 or fewer letters and numbers, with *NO* punctuation, to facilitate labeling graphs with the tags.

Last updated: 22-Oct-2009
Date created: 16-Apr-2009
claclarke@plg.uwaterloo.ca