Impact of Review-Set Selection on Human Assessment for
Text Classification
Adam Roegiest & Gordon V. Cormack
Download authors’ copy
DOI (permanent link): http://dx.doi.org/10.1145/2911451.2911510
Abstract
In a laboratory study, human assessors were significantly
more likely to judge the same documents as relevant when
they were presented for assessment within the context of
documents selected using random or uncertainty sampling,
as compared to relevance sampling. The ect is substantial
and significant [0.54 vs. 0.42, p<0.0002] across a population
of documents including both relevant and non-relevant doc-
uments, for several definitions of ground truth. This result
is in accord with Smucker and Jethani's SIGIR 2010 finding
that documents were more likely to be judged relevant when
assessed within low-precision versus high-precision ranked
lists. Our study supports the notion that relevance is malleable, and that one should take care in assuming any labeling to be ground truth, whether for training, tuning, or
evaluating text classifiers.