Impact of Review-Set Selection on Human Assessment for Text Classification

Adam Roegiest & Gordon V. Cormack

Presented at SIGIR 2016

Download authors’ copy

DOI (permanent link):


In a laboratory study, human assessors were significantly more likely to judge the same documents as relevant when they were presented for assessment within the context of documents selected using random or uncertainty sampling, as compared to relevance sampling. The ect is substantial and significant [0.54 vs. 0.42, p<0.0002] across a population of documents including both relevant and non-relevant doc- uments, for several definitions of ground truth. This result is in accord with Smucker and Jethani's SIGIR 2010 finding that documents were more likely to be judged relevant when assessed within low-precision versus high-precision ranked lists. Our study supports the notion that relevance is malleable, and that one should take care in assuming any labeling to be ground truth, whether for training, tuning, or evaluating text classifiers.