TREC 2005 Spam Public Corpora


TREC 2005 Spam Track Public Corpus

Copyright 2005 Gordon V. Cormack and Thomas R. Lynam

    gvcormac@uwaterloo.ca
    trlynam@uwaterloo.ca

Includes material released to public domain and material
used with permission.

Permission is granted for research use provided
users agree to, and abide by, the Usage Agreement.

Permission is NOT granted to publish this corpus
or any material portion (including file names and
judgements).

It is our intention to make this corpus available
to non-participants, on request, at a future date.


----

INSTRUCTIONS

0. WARNING! This corpus contains viruses, fraudulent solicitations,
   and other files that may pose a security risk.  Do not view any
   files in the folder named data with an ordinary browser or
   email client.  Also note that virus or adware removal tools may 
   damage the corpus.

1. The compressed file may be uncompressed with gzip, Winzip,
   or any other utility that understands gzip format.

2. The compressed file will unpack to a folder named trec05p-1

3. There is one main corpus with four subsets:

   trec05p-1/full   -- the main corpus with 92,189 messages
   trec05p-1/ham25  -- subset of full: 100% of spam, 25% of ham
   trec05p-1/ham50  -- subset of full: 100% of spam, 50% of ham
   trec05p-1/spam25 -- subset of full: 25% of spam, 100% of ham
   trec05p-1/spam50 -- subset of full: 50% of spam, 100% of ham

4. Corpus is compatible with "TREC Spam Filter Evaluation Toolkit"
   using the commands:

      run.sh trec05p-1/full/
      run.sh trec05p-1/ham25/
      run.sh trec05p-1/ham50/
      run.sh trec05p-1/spam25/
      run.sh trec05p-1/spam50/