Date: Sat, 19 May 2007 09:24:07 -0400 From: "Gordon V. Cormack" <gvcormac@uwaterloo.ca> To: trecspam@nist.gov Subject: TREC 2007 Spam Track We are still in the process of finalizing the TREC 2007 Spam Track guidelines. This is an interim report. We should have the final guidelines by the end of June. In the meantime, please feel free to post queries or comments to this list. The primary tasks will use the same tool kit and data format as last year; i.e. the TREC spam filter evaluation toolkit available here: http://plg.uwaterloo.ca/~gvcormac/spam/ The active learning task will use a different version of the toolkit, to be available shortly. The three tasks are: On-line filtering with immediate feedback. - exactly the same task as for TREC 2005 and 2006 On-line filtering with delayed/incomplete/noisy feedback. - same tools as for the TREC 2006 delayted feedback, but the test data may not contain "train" commands for every message, and some of the "train" commands may be wrong so as to simulate user underreporting and user error. Emphasis will be placed on correct classification of a large number of messages with no feedback. (For example, there may be no feedback for the last half of the messages) On-line active filtering with active learning. - different tools and task from TREC 2006 active learning Filters will perform on-line classification as for the other two tasks, but will be allowed to query the true class (ham or spam) of a fraction of the messages, chosen by the filter. This will be effected by an additional command "query" added to the toolkit: query <message> where <message> is a message previously classified by the filter. Filters will be submitted to NIST (date to be determined, but probably mid-July) after which a public corpus will be released. Results on the public corpus will also be submitted to NIST (probably late August). -- Gordon V. Cormack CS Dept, University of Waterloo, Canada N2L 3G1 gvcormack@uwaterloo.ca http://cormack.uwaterloo.ca/cormack