TREC 2010 Web Track GuidelinesNick Craswell, Microsoft Research
Charles Clarke, University of Waterloo
Ian Soboroff (NIST Contact)
Welcome to the to the TREC 2010 Web Track. Our goal is to explore and evaluate Web retrieval technologies over the billion-page ClueWeb09 Dataset. This year the track will comprise an adhoc retrieval task, a diversity task, and a new spam filtering task.
We assume you arrived at this page because you're participating in this year's TREC conference. If not, you should start at the TREC main page.
If you're new to the TREC Web Track, you may want to start by reading the overview paper from last year's track. Take note of the size of the collection, roughly 25TB uncompressed. We suggest that you obtain the collection and start working with it as soon as possible. It's a lot of data. You may also want to read the overview papers associated with the older Terabyte Retrieval Track, which ran from 2004-2006, and the previous Web Track, which ran from 1999-2003.
If you're a previous TREC Web Track participant, many of these guidelines will be familiar from last year, but there have been some changes. The spam filtering task is new. For the adhoc and diversity tasks, the topic construction and judging procedures have been modified from last year. We have introduced a six-point scale for adhoc judgments. All judged runs will be fully judged according to both the adhoc and diversity criteria to some minimum depth k ≥ 10.
If you're planning to participate in the track, you should be on the track mailing list. If you're not on the list, send a mail message to listproc (at) nist (dot) gov such that the body consists of the line "subscribe trec-web FirstName LastName".
Older Web Tracks have explored specific aspects of Web retrieval, including named page finding, topic distillation, and traditional adhoc retrieval. In 2009 we introduced a new diversity task that combines aspects of all these older tasks. The goal of this diversity task is to return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the result list. We continue the exploration of this task in 2010.
An analysis of last year's results indicates that the presence of spam and other low-quality pages substantially influenced the overall results. This year we are providing a preliminary spam ranking of the pages in the corpus, as an aid to groups who wish to reduce the number of low-quality pages in their results. An associated spam task requires groups to provide their own ranking of the corpus according to "spamminess".
In addition to the continuation of the diversity task and the introduction of the spam task, we are modifying the traditional assessment process of the adhoc task to incorporate multiple relevance levels, which are similar in structure to the levels used in commercial Web search. This new assessment structure includes a spam/junk level, which will assist in the evaluation of the spam task. The top two levels of the assessment structure are closely related to the homepage finding and topic distillation tasks appearing in older Web Tracks.
The adhoc and diversity tasks will share topics, which will be developed with the assistance of information extracted from the the logs of a commercial Web search engine. Topic creation and judging will attempt to reflect a mix of genuine user requirements for the topic. See below for example topics.
The track will again use the ClueWeb09 dataset as its document collection. The full collection consists of roughly 1 billion web pages, comprising approximately 25TB of uncompressed data (5TB compressed) in multiple languages. The dataset was crawled from the Web during January and February 2009.
Further information regarding the collection can be found on the associated Website. Since it can take several weeks to obtain the dataset, we urge you to start this process as soon as you can. The collection will be shipped to you on four 1.5TB hard disks at an expected cost of US$790 plus shipping charges.
If you are unable to work with the full dataset, we will accept runs over the smaller ClueWeb09 "Category B" dataset, but we strongly encourage you to use the full "Category A" dataset if you can. The Category B dataset represents a subset of about 50 million English-language pages. The Category B dataset can be ordered through the ClueWeb09 Web. It will be shipped to you on a single 1.0TB hard disk at an expected cost of US$240 plus shipping charges.
An adhoc task in TREC investigates the performance of systems that search a static set of documents using previously-unseen topics. The goal of the task is to return a ranking of the documents in the collection in order of decreasing probability of relevance. The probability of relevance of a document is considered independently of other documents that appear before it in the result list. For each topic, participants submit a ranking of the top 10,000 documents for that topic.
NIST will create and assess new 50 topics for the task, but NIST will not release the full topics to the participants until after runs are submitted. Instead, the initial release of the topic will consist of 50 queries (the topic "titles" in the traditional TREC jargon). No other information regarding the topics will be provided as part of the initial release.
An experimental run consists of the top 10,000 documents for each topic query. The process of executing the queries over the documents and generating the experimental runs should be entirely automatic. There should be no human intervention at any stage, including modifications to your retrieval system motivated by an inspection of the queries. You should not materially modify your retrieval system between the time you download the queries and the time you submit your runs.
At least one run from each group will be judged by NIST assessors. Each document will be judged on a six-point scale, as follows:
This year the primary evaluation measure will be expected reciprocal rank (ERR) as defined by Chapelle et al. (CIKM 2009). In addition to ERR, we will compute and report a range of standard measures, including precision@10 and NDCG@10. Depending on judging resources available, we may also report estimates of mean average precision (MAP) following methods similar to those used last year.
You may submit up to three runs for the adhoc task; at least one will be judged. NIST may judge additional runs per group depending upon available assessor time. During the submission process you will be asked to rank your submissions in the order that you want them judged. If you give conflicting rankings across your set of runs, NIST will choose the run to assess arbitrarily.
The format for submissions is given in a separate section below. Each topic must have at least one document retrieved for it. While many of the evaluation measures used in this track consider only the top 10-20 documents, methods for estimating MAP sample at deeper levels, and we request that you return the top 10,000 to aid in this process. You may return fewer than 10,000 documents for a topic. However, you cannot hurt your score, and could conceivably improve it, by returning 10,000 documents per topic. All the evaluation measures used in the track count empty ranks as not relevant (Non).
The diversity task is similar to the adhoc retrieval task, but differs in its judging process and evaluation measures. The goal of the diversity task is to return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the result list. For this task, the probability of relevance of a document is conditioned on the documents that appear before it in the result list.
For the purposes of the diversity track, each topic will be structured as a representative set of subtopics, each related to a different user need. Example are provided below. Documents will be judged with respect to the subtopics. For each subtopic, NIST assessors will make a binary judgment as to whether or not the document satisfies the information need associated with the subtopic.
Topics will be fully defined by NIST in advance of topic release, but only the query field will be initially released. Detailed topics will be released only after runs have been submitted. Subtopics will be based on information extracted from the logs of a commercial search engine, and will roughly balanced in terms of popularity. Strange and unusual interpretations and aspects will be avoided as much as possible.
Developing and validating metrics for diversity tasks continues to be a goal of the track, and we will report a number of evaluation measures that have been proposed over the past several years. These measures will include an intent aware version of the ERR measure (ERR-IA) proposed by Chapelle et al. (CIKM 2009), the α-nDCG measure proposed by Clarke et al. (SIGIR 2008), and the novelty- and rank-biased precision (NRBP) measure proposed by Clarke et al. (ICTIR 2009). Those papers should be consulted for more information.
In all other respects, the diversity task is identical to the adhoc task. The same 50 topics will be used. Query processing must be entirely automatic. The submission format is the same. The top 10,000 documents should be submitted. You may submit up to three runs, at least one of which will be judged.
The topic structure will be similar to that used for the TREC 2009 topics. The topics below provide examples.
<topic number="6" type="ambiguous"> <query>kcs</query> <description>Find information on the Kansas City Southern railroad. </description> <subtopic number="1" type="nav"> Find the homepage for the Kansas City Southern railroad. </subtopic> <subtopic number="2" type="inf"> I'm looking for a job with the Kansas City Southern railroad. </subtopic> <subtopic number="3" type="nav"> Find the homepage for Kanawha County Schools in West Virginia. </subtopic> <subtopic number="4" type="nav"> Find the homepage for the Knox County School system in Tennessee. </subtopic> <subtopic number="5" type="inf"> Find information on KCS Energy, Inc., and their merger with Petrohawk Energy Corporation. </subtopic> </topic> <topic number="16" type="faceted"> <query>arizona game and fish</query> <description>I'm looking for information about fishing and hunting in Arizona. </description> <subtopic number="1" type="nav"> Take me to the Arizona Game and Fish Department homepage. </subtopic> <subtopic number="2" type="inf"> What are the regulations for hunting and fishing in Arizona? </subtopic> <subtopic number="3" type="nav"> I'm looking for the Arizona Fishing Report site. </subtopic> <subtopic number="4" type="inf"> I'd like to find guides and outfitters for hunting trips in Arizona. </subtopic> </topic>
Initial topic release will include only the query field.
As shown in these examples, topics are categorized as either "ambiguous" or "faceted". Ambiguous queries are those that have multiple distinct interpretations. We assume that a user interested in one interpretation would not be interested in the others. On the other hand, facets reflect underspecified queries, with different aspects covered by the subtopics. We assume that a user interested in one aspect may still be interested in others.
Each subtopic is categorized as being either navigational ("nav") or informational ("inf"). A navigational subtopic usually has only a small number of relevant pages (often one). For these subtopics, we assume the user is seeking a page with a specific URL, such as an organization's homepage. On the other hand, an informational query may have a large number of relevant pages. For these subtopics, we assume the user is seeking information without regard to its source, provided that the source is reliable.
For the adhoc task, relevance is judged on the basis of the description field. For the diversity task, a document may not be relevant to any subtopic, even if it is relevant to the overall topic. The set of subtopics is intended to be representative, not exhaustive. We expect each topic to contain 4-10 subtopics.
Note: We may be able to obtain probabilities indicating the relative importance of the subtopics. If so, we will include these probabilities in the topics and use them in the computation of the evaluation measures. Otherwise, we will assume subtopics are of equal importance. Further information will be posted on the mailing list in May.
Submission Format for Adhoc and Diversity Tasks
All adhoc and diversity task runs must be compressed (gzip or bzip2).
For both tasks, a submission consists of a single ASCII text file in the format used for most TREC submissions, which we repeat here for convenience. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.
5 Q0 clueweb09-enwp02-06-01125 1 32.38 example2010 5 Q0 clueweb09-en0011-25-31331 2 29.73 example2010 5 Q0 clueweb09-en0006-97-08104 3 21.93 example2010 5 Q0 clueweb09-en0009-82-23589 4 21.34 example2010 5 Q0 clueweb09-en0001-51-20258 5 21.06 example2010 5 Q0 clueweb09-en0002-99-12860 6 13.00 example2010 5 Q0 clueweb09-en0003-08-08637 7 12.87 example2010 5 Q0 clueweb09-en0004-79-18096 8 11.13 example2010 5 Q0 clueweb09-en0008-90-04729 9 10.72 example2010 etc.
The goal of the spam task is to score each English document in the full ClueWeb09 collection (about 500 million documents in total) according to how likely it is to be spam. For the purposes of this task, we employ a broad definition of spam, which comprises pages that are essentially junk, along with pages that are more obviously deceptive or detrimental. Spam had a major impact on the 2009 TREC submissions; practically every one was improved dramatically by the application of a spam filter.
Participant submissions will be evaluated by how well they identify spam in the ad hoc submissions, as measured by area under the receiver operating characteristic curve (AUC). The submission format is exactly that used for the Waterloo Spam Rankings for the ClueWeb09 Dataset. The Waterloo rankings provide a baseline for comparison, and may also be used as a resource by adhoc, diversity and spam task participants.
To generate a submission file for the spam task, first create a text file of 503,903,810 lines with the format
percentile-score clueweb09-docidwhere each percentile-score is an integer between 0 and 99 inclusive, indicating the percentage of the collection ranked below the specified document. The spammiest 1% of the documents should be given a score of 0, the next 1% a score of 1, and so on, with the least spammy 1% given a score of 99. The ClueWeb09 document identifiers should appear either in alphabetical order, or in corpus order (the two orders differ because the documents with the prefix en-wp are out of order). Each document identifier should appear exactly once. This file will be about 14GB in size.
The submission must be compressed with a special compression program compress-spamp.c and then bzip2. On a Linux system, this process is effected by the commands:
wget http://durum0.uwaterloo.ca/clueweb09spam/compress-spamp.c gcc -O2 -o compress-spamp compress-spamp.c ./compress-spamp < scorefile | bzip2 -c > submitfileThe resulting submitfile will be about 350MB in size.
Last updated: 07-Jun-2010
Date created: 29-Apr-2010