See also: Spam scores for the ClueWeb12 dataset available here.
The ClueWeb09 Dataset is a crawl of 1 billion Web pages available for information retrieval research. One half of the pages are English. The English subset is the primary dataset used by TREC -- The Text Retrieval Conference in 2009 and subsequent years.
The method by which the spam scores were computed, and the impact on the TREC 2009 results are described here: Efficient and effective spam filtering and re-ranking for large Web datasets, by Cormack, Smucker and Clarke. The four sets of scores may be downloaded here:
wget http://durum0.uwaterloo.ca/clueweb09spam/decompress-spamp.c gcc -O2 -o decompress-spamp decompress-spamp.c wget http://durum0.uwaterloo.ca/clueweb09spam/clueweb09spam.Fusion.bz2 bunzip2 -c clueweb09spam.Fusion.bz2 | ./decompress-spamp > clueweb09spam.FusionThe file clueweb09spam.Fusion will have 503,903,810 lines with the following format:
percentile-score clueweb-docidNote 1: The documents in the spam score files are in alphabetical order. However, the documents in the clueweb09 dataset are not, because the "enwp" documents are out of order.
The percentile score indicates the percentage of the documents in the corpus that are "spammier." That is, the spammiest 1% of the documents have percentile-score=0, the next spammiest have percentile-score=1, and so on. The least spammy 1% have percentile-score=99.
If you just want the best of the four sets of scores, choose Fusion. If you just want to label pages as spam or not, label those with percentile-score<70 to be spam, and the rest non-spam. For more details, read the paper.
wget http://durum0.uwaterloo.ca/clueweb09spam/decompress-spam.c gcc -O2 -o decompress-spam decompress-spam.c wget http://durum0.uwaterloo.ca/clueweb09spam/clueweb09spam.FusionLogOdds.bz2 bunzip2 -c clueweb09spam.FusionLogOdds.bz2 | ./decompress-spam > clueweb09spam.FusionLogOddsThe decompressed file clueweb09spam.FusionLogOdds (16GB) will have 503,903,810 lines with the following format:
log-odds-score clueweb-docidThe log-odds-score is a floating-point number such that
Prob(page is spam) log-odds-score = log --------------------- Prob(page is nonspam)