<h2>Engineering Quality and Reliability in Technology-Assisted Review</h2>

<h3>Gordon V. Cormack &amp; Maura R. Grossman</h3>

<h3><i>Presented at <a href=http://sigir.org/sigir2016/>SIGIR 2016</a></i></h3>

<p><a href=cormackgrossman16.pdf>Download authors&rsquo; copy</a>

<p>DOI (<i>permanent link</i>): <a href=http://dx.doi.org/10.1145/2911451.2911510>http://dx.doi.org/10.1145/2911451.2911510</a>


<h3>Abstract</h3>

The objective of technology-assisted review (&ldquo;TAR&rdquo;) is to find as much relevant information as possible with reasonable effort. Quality is a measure of the extent to which a TAR method achieves this objective, while reliability is a measure of how consistently it achieves an acceptable result. We are concerned with how to define, measure, and achieve high quality and high reliability in TAR. When quality is defined using the traditional goal-post method of specifying a minimum acceptable recall threshold, the quality and reliability of a TAR method are both, by definition, equal to the probability of achieving the threshold. Assuming this definition of quality and reliability, we show how to augment any TAR method to achieve guaranteed reliability, for a quantifiable level of additional review effort. We demonstrate this result by augmenting the TAR method supplied as the baseline model implementation for the TREC 2015 Total Recall Track, measuring reliability and effort for 555 topics from eight test collections. While our empirical results corroborate our claim of guaranteed reliability, we observe that the augmentation strategy may entail disproportionate effort, especially when the number of relevant documents is low. To address this limitation, we propose stopping criteria for the model implementation that may be applied with no additional review effort, while achieving empirical reliability that compares favorably to the provably reliable method. We further argue that optimizing reliability according to the traditional goal-post method is inconsistent with certain subjective aspects of quality, and that optimizing a Taguchi quality loss function may be more apt.