TREC Legal Discovery Track Final Guidelines for 2006 April 15, 2006 Revision history: Mar 30: Original Apr 4: Added judgment of top 10 for all runs, fixed a typo Apr 15: Delayed release of documents to May 1, changed distribution info Introduction The goal of the TREC-2006 legal discovery track is to evaluate the effectiveness of search technologies under conditions similar to those faced in legal discovery contexts. Increasingly, lawyers use automated search and retrieval tools to sort through the vast amount of evidence in electronic form that may be relevant to issues arising in litigation. While the legal community is familiar with conducting simple term-based searches against large data sets, the efficacy of varying search methodologies and approaches in finding responsive documents in legal settings has not been generally studied. The TREC legal discovery track provides a forum to evaluate the results of searches on "topics" approximating real legal requests in litigation. To this end, a set of hypothetical complaints and requests for the production of documents have been created. The documents to be searched have been drawn from those released in the tobacco Master Settlement Agreement between several States and leading tobacco companies. Assessments for relevance will be made by lawyers consistent with the rules governing civil discovery and the standard for admissibility of legal evidence in court. The TREC-2006 legal discovery track is the first step in a research program we believe will be of enormous value to the legal profession as a whole, in aiding the "just, speedy, and inexpensive" determination of every case in court (the standard set forth in Rule 1 of the Federal Rules of Civil Procedure). How to Participate 1) Register as a TREC participant (see http://trec.nist.gov/call06.html). Late registration can probably be accommodated. 2) Join the mailing list by sending an email to oard (at) umd.edu. The mailing list archives are available on the track's Web site: http://trec-legal.umiacs.umd.edu/ 3) Order the document collection (available May 1). 4) Obtain training production requests and example relevance judgments from the track Web site (available May 1). 5) Obtain evaluation production requests from the track Web site (available July 1, but do not look at them until your system is completely ready if you plan to do automatic runs). 6) Generate ranked lists for each evaluation production request using one or more techniques. 7) Submit one or more sets of ranked lists. We will accept a maximum of 8 sets for official scoring. (due August 1) 8) Score any additional runs locally using relevance judgments provided by the organizers. (available by October 1, and maybe sooner) 9) Write a working notes paper. 10) Attend the TREC-2006 conference. 11) Revise your working notes paper for the final proceedings. Data Provided by the Organizers 1) Documents. The set of documents for the track will be the IIT Complex Document Information Processing test collection. This collection consists of roughly 7 million documents (approximately 57 GB of metadata and OCR text uncompressed, 23 GB compressed) drawn from the Legacy Tobacco Document Library hosted by the University of California at San Francisco. These documents were made public during various legal cases involving US tobacco companies and contain a wide variety of document genres typical of large enterprise environments. The documents were released in scanned form (or in paper form, and then were scanned), so the original source files are not available. Optical character recognition (OCR) was applied by UCSF to all documents in an attempt to produce retrievable text for all documents. The quality of the OCR varies widely, and some documents have no OCR data. However, all documents also have metadata records produced by the tobacco companies. We have obtained a complete copy of the documents, both the OCR+metadata records and the TIFF images (which will not be directly used in the track), from UCSF. Reformatting and validation were performed at the Illinois Institute of Technology, David D. Lewis Consulting, and University of Maryland. The metadata and OCR can be obtained by FTP at no charge. For teams unable to transfer this quantity of data by FTP, the collection will also be available by mail as a set of DVD's from NIST. Details on procedures for obtaining the collection will be added to these guidelines as soon as they are finalized (no later than May 1). 2) Production requests. Participants in the track will search the IIT CDIP collection for documents relevant to a set of production requests. The production requests will be designed to simulate the kinds of requests that parties in a legal case would make for internal documents to be produced by other parties in the legal case. Each production request includes a broad complaint that lays out the background for several requests, one specific request for production of documents, and a negotiated Boolean query that serves as a reference and is also available for use by ranked retrieval systems. Participating teams may form queries in any way they like, using materials provided in the complaint, the production request, the Boolean query, and any external resources that they have available (e.g., a domain-specific thesaurus). Note in particular that the Boolean query need not be used AS a Boolean query; it is provided as an example of what might be negotiated in present practice, and teams are free to use its contents in whatever way they think is appropriate. Queries that are formed completely automatically using software that existed at the time the evaluation queries were first seen are considered automatic; all other cases are considered manual queries. Automatic queries provide a reasonably well controlled basis for cross-system comparisons, although they are typically representative of only the first query in an interactive search process. The most common use of manual queries is to demonstrate the retrieval effectiveness that can be obtained after interactive optimization of the query (which typically results in excellent contributions to the judgment pool and are thus highly desirable), but even interventions as simple as manual removal of stopwords or stop structure will result in manual queries. A set of training topics with a few relevance judgments (more as an example of format than to provide meaningful training data) will also be made available in April 2006. 3) Relevance judgments. As usual in TREC, there are two somewhat competing goals for the evaluation: a. Measure the effectiveness and understand the behavior of the technologies applied by the participants in this year's evaluation. b. Identify and record as many of the documents relevant to each topic as possible. The goal here is allow future users of the collection to approximately measure the effectiveness of new technologies under the assumption that all documents relevant to a topic are known. We will pursue these two goals by having experts judge the top ranked documents retrieved by each system for each topic, as well as using various manual and statistical sampling procedures. Binary (yes/no) relevance judgments will be created in a distributed fashion using a web-based platform that will give judges the ability to see the TIFF images for the documents. At least the top 10 documents will be judged for each run; some runs will be selected for deeper assessment based on sampling strategies necessary to support comparison with Boolean retrieval and (for ranked retrieval) on the judgment priorities suggested by participating teams at the time of submission. Data to be Submitted to the Organizers Participating sites are invited to submit results from up to 8 runs for official scoring. All submitted runs will be scored and reported, and a comparative analysis will be reported for at least the following standard conditions: 1. Full document set, including metadata, automatic queries 2. Good-OCR subset, no use of metadata, automatic queries 3. Full document set, including metadata, manual queries from any source Participating sites are not strictly required to run those standard conditions, but in the interest of comparative evaluation they are asked to submit results for as many variants as possible. Participants should send one ranked list and one run description for each run. 1) Ranked list (formatted system output) For each run, the top-5000 results will be accepted in the format: topicid Q0 docno rank score tag topicid - topic identifier Q0 - unused field (the literal 'Q0') docno - document identifier taken from the DOCNO field of the document rank - rank assigned to the segment (1=highest) score - numeric score that is non-increasing with rank. It is the score, not the rank, that is used by the evaluation software (see the "TREC 2006: Welcome to TREC" message from Ellen Voorhees for more details) tag - unique identifier for this run (same for every topic and segment) Participating sites should each adopt some standard convention for tags that begins with a unique identifier for the site and ends with a unique run identifier. Tags are limited to 12 letters or numbers with no punctuation. For example, the University of Ypsilanti might create a tag: UyFuMdBo1 for a run with the full collection, including Metadata, and queries derived from only the Boolean query field. 2) Run description (text) Please provide the institution name, name of the contact person at that institution, and complete contact information for that person (email, phone, postal address) and complete the following questionnaire for each run: RUN: (list the "tag" for the run here) Judgment pool priority (1=highest, 8=lowest). Assign the highest priority to what you expect to be your best run, and the next highest priority to the run that you expect to yield (interesting) results that are most different from your best run. Dn not assign priority 2 to what you expect to be your second best run unless it is substantially different from your best run -- if the two runs yield similar results, this would result in no enrichment of the judgment pools. Indexed document fields (OCR, Metadata or both). If you indexed some but not all of the metadata, please indicate which fields were indexed. Automatic or Manual search? (list "Manual" for runs with any human intervention, including any changes to the system implementation after reading the evaluation topics, any manual editing of the contents of the topic fields that are used, and any adjustments made to system output based on human examination of the system results). Bug fixes to the system after examining the topics result in manual runs. A description of how the queries was formed. Please at least say which fields were used (complaint, production request, and/or Boolean). Evaluation Measures The principal measure of effectiveness will be mean precision at B, where B is the number of documents retrieved by a negotiated Boolean search (B will be different for each production request). Pooled relevance judgments will be used to establish a lower bound on the ranked runs and random sampling will be used to obtain an unbiased estimate of the precision at B for the baseline Boolean run. Participating teams are encouraged to develop and report additional measures, including those that focus on differences rather than means across topics and documents. Schedule (all dates 2006) May 1 Collection release with training topics Jul 1 Evaluation topic release Aug 1 Runs submitted by sites Oct 1 Results release mid-Oct Working notes papers due(exact date TBA) Nov 14-17 TREC-2006, Gaithersburg, MD For Additional Information The track Web site at http://trec-legal.umiacs.umd.edu/ contains links to resources and background information. The track mailing list archives can be reached through a link from that page. For additional questions, please contact one of the track coordinators: Jason R. Baron jason.baron (at) nara.gov David D. Lewis davelewis (at) daviddlewis.com Douglas W. Oard oard (at) umd.edu