Waterloo Spam Rankings for the ClueWeb09 Dataset

Gordon V. Cormack

University of Waterloo


This site provides four sets of spam scores for the English documents in the ClueWeb09 dataset.

See also: Spam scores for the ClueWeb12 dataset available here.

The ClueWeb09 Dataset is a crawl of 1 billion Web pages available for information retrieval research. One half of the pages are English. The English subset is the primary dataset used by TREC -- The Text Retrieval Conference in 2009 and subsequent years.

The method by which the spam scores were computed, and the impact on the TREC 2009 results are described here: Efficient and effective spam filtering and re-ranking for large Web datasets, by Cormack, Smucker and Clarke. The four sets of scores may be downloaded here:

Each file must be decompressed using a special compression/decompression program, which may be downloaded here: To download and decompress a set of scores (Fusion, for example) on a *nix system:
   wget http://durum0.uwaterloo.ca/clueweb09spam/decompress-spamp.c
   gcc -O2 -o decompress-spamp decompress-spamp.c
   wget http://durum0.uwaterloo.ca/clueweb09spam/clueweb09spam.Fusion.bz2
   bunzip2 -c clueweb09spam.Fusion.bz2 | ./decompress-spamp > clueweb09spam.Fusion
The file clueweb09spam.Fusion will have 503,903,810 lines with the following format:
   percentile-score clueweb-docid
Note 1: The documents in the spam score files are in alphabetical order. However, the documents in the clueweb09 dataset are not, because the "enwp" documents are out of order.
Note 2: the decompressed file is 15GB.

The percentile score indicates the percentage of the documents in the corpus that are "spammier." That is, the spammiest 1% of the documents have percentile-score=0, the next spammiest have percentile-score=1, and so on. The least spammy 1% have percentile-score=99.

If you just want the best of the four sets of scores, choose Fusion. If you just want to label pages as spam or not, label those with percentile-score<70 to be spam, and the rest non-spam. For more details, read the paper.

Log-Odds Version

Users interested in using the score in a probability model may be interested in the log-odds version of the Fusion score set, which requires a different special-purpose compression/decompression program: Download and decompress like this:
   wget http://durum0.uwaterloo.ca/clueweb09spam/decompress-spam.c
   gcc -O2 -o decompress-spam decompress-spam.c
   wget http://durum0.uwaterloo.ca/clueweb09spam/clueweb09spam.FusionLogOdds.bz2
   bunzip2 -c clueweb09spam.FusionLogOdds.bz2 | ./decompress-spam > clueweb09spam.FusionLogOdds
The decompressed file clueweb09spam.FusionLogOdds (16GB) will have 503,903,810 lines with the following format:
   log-odds-score clueweb-docid
The log-odds-score is a floating-point number such that
                         Prob(page is spam)
   log-odds-score = log ---------------------
                        Prob(page is nonspam)