Waterloo Spam Rankings for the ClueWeb09 Dataset

Gordon V. Cormack

University of Waterloo

gvcormac@uwaterloo.ca

This site provides four sets of spam scores for the English documents in the ClueWeb09 dataset.

See also: Spam scores for the ClueWeb12 dataset available here.

The ClueWeb09 Dataset is a crawl of 1 billion Web pages available for information retrieval research. One half of the pages are English. The English subset is the primary dataset used by TREC -- The Text Retrieval Conference in 2009 and subsequent years.

The method by which the spam scores were computed, and the impact on the TREC 2009 results are described here: Efficient and effective spam filtering and re-ranking for large Web datasets, by Cormack, Smucker and Clarke. The four sets of scores may be downloaded here:

Each file must be decompressed using a special compression/decompression program, which may be downloaded here:

compress-spamp.c The compression program.
decompress-spamp.c The decompression program.

To download and decompress a set of scores (Fusion, for example) on a *nix system:

   wget http://durum0.uwaterloo.ca/clueweb09spam/decompress-spamp.c
   gcc -O2 -o decompress-spamp decompress-spamp.c
   wget http://durum0.uwaterloo.ca/clueweb09spam/clueweb09spam.Fusion.bz2
   bunzip2 -c clueweb09spam.Fusion.bz2 | ./decompress-spamp > clueweb09spam.Fusion

The file clueweb09spam.Fusion will have 503,903,810 lines with the following format:

   percentile-score clueweb-docid

Note 1: The documents in the spam score files are in alphabetical order. However, the documents in the clueweb09 dataset are not, because the "enwp" documents are out of order.
Note 2: the decompressed file is 15GB.

The percentile score indicates the percentage of the documents in the corpus that are "spammier." That is, the spammiest 1% of the documents have percentile-score=0, the next spammiest have percentile-score=1, and so on. The least spammy 1% have percentile-score=99.

If you just want the best of the four sets of scores, choose Fusion. If you just want to label pages as spam or not, label those with percentile-score<70 to be spam, and the rest non-spam. For more details, read the paper.

Log-Odds Version

Users interested in using the score in a probability model may be interested in the log-odds version of the Fusion score set, which requires a different special-purpose compression/decompression program:

Download and decompress like this:

   wget http://durum0.uwaterloo.ca/clueweb09spam/decompress-spam.c
   gcc -O2 -o decompress-spam decompress-spam.c
   wget http://durum0.uwaterloo.ca/clueweb09spam/clueweb09spam.FusionLogOdds.bz2
   bunzip2 -c clueweb09spam.FusionLogOdds.bz2 | ./decompress-spam > clueweb09spam.FusionLogOdds

The decompressed file clueweb09spam.FusionLogOdds (16GB) will have 503,903,810 lines with the following format:

   log-odds-score clueweb-docid

The log-odds-score is a floating-point number such that

                         Prob(page is spam)
   log-odds-score = log ---------------------
                        Prob(page is nonspam)