Supervised Learning from Multiple Experts: Whom to trust when everyone lies a bit (2009)

Authors

Abstract

We describe a probabilistic approach for supervised learning when we have multiple experts/annotators providing (possibly noisy) labels but no absolute gold standard. The proposed algorithm evaluates the different experts and also gives an estimate of the actual hidden labels. Experimental results indicate that the proposed method clearly beats the commonly used majority voting baseline.

Discussion

Bob Carpenter, 2009/06/16 17:00

I'm particularly interested in the effect of mixing in classifier training and then treating it on a par with other annotators. I've thought about doing this, but I'm not convinced it makes sense conceptually to have the classifier influence ground truth estimation. Even though the annotation model corrects for individual annotator bias (or in this case, the logistic regression classifier's bias), each annotator still affects the overall model through its bias-adjusted vote (if it didn't, you couldn't get off the ground at all). If you use the inferred standard including the trained classifier, the classifier should perform better because it's getting a vote on the truth! So the real question is whether it'll learn the right models. For that, I'd think we'd need some kind of held-out eval, but that begs the question on inferring the gold standard. The gold standards behind Snow et al.'s work weren't that pure after all (I have some commentary on discrepancies in the paper cited below).

I have considered using the trained classifier as another annotator when doing active learning of the kind proposed in Sheng et al.'s paper on getting another label for an existing item vs. annotating a new item. In fact, there's no reason in principle why you can't have more than one classifier being trained along with annotator sensitivities and specificities.

I wrote some tech reports and a poster on a Bayesian generalization of these models, which(a) estimated the beta priors for annotator sensitivity and specificity, (b) introduced random effects (more latent items) for item difficulty, and © computed Bayesian posteriors using Gibbs sampling (with BUGS). I found the Bayesian model was more accurate than max likelihood as computed by EM (even with the beta prior estimates), and that max likelihood was more accurate than simple voting (against Snow et al.'s gold standards); the effects were greatest with small numbers of annotators per item.

You can find my papers on the topic on the white papers section of our blog (including a 2-page poster I presented with the simplest model up to a tech report with more elaborate random effects models that account for item tagging difficulty):

http://lingpipe-blog.com/lingpipe-white-papers/

Unfortunately (for me), my 2009 NAACL submission on the topic was rejected, but it's available as a PDF:

http://lingpipe-blog.com/2009/01/23/lacks-a-convincing-comparison-with-simple-non-bayesian-approaches/

All the R and BUGS code I used to fit the models are available from the LingPipe sandbox, project hierAnno.

Vikas Raykar, 2009/06/22 12:58

The results do show that taking features into consideration helps us to get a better gold standard. This is intuitive since as long as a feature has some predictive power it can be considered as yet another noisy expert.

As an estimation problem I am only trying to learn the classifier and the experts sensitivity and specificity. So when leaning the classifier the features naturally come into play. In terms of the estimation problem *the ground-truth is not a parameter*–the actual hidden ground-truth conveniently appears as missing data in the EM algorithm. As such I am a bit perplexed on how to assess the performance for the estimation of the hidden variable in the EM. Does the hidden variable in the maximum likelihood enjoy the same properties as the other parameters?

I see that we have a lot of overlap in our work. It is unfortunate that your earlier NAACL submission was not accepted.

Vikas Raykar, 2009/06/22 13:09

After attending ICML I have been made aware of the following related work. I thought I should post these here

1. Work by Bob Carpenter (see post above)

2. Pinar Donmez has some work on proactive learning–active learning with multiple imperfect oracles. See her KDD 2009 paper–Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling, Pinar Donmez, Jaime Carbonell, and Jeff Schneider. 3. S K Warfield, K H Zou, and W M Wells. 2004. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Trans Med Imaging 23 (7):903-21.

Vikas Raykar, 2009/06/22 13:09

After attending ICML I have been made aware of the following related work. I thought I should post these here

1. Work by Bob Carpenter (see post above)

2. Pinar Donmez has some work on proactive learning–active learning with multiple imperfect oracles. See her KDD 2009 paper–Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling, Pinar Donmez, Jaime Carbonell, and Jeff Schneider. 3. S K Warfield, K H Zou, and W M Wells. 2004. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Trans Med Imaging 23 (7):903-21.

Brendan O'Connor, 2009/07/07 19:08

Hi, I read the paper and quite enjoyed it. Well done.

Enter your comment (wiki syntax is allowed):
XLHSL
 
paper/2009/96.txt · Last modified: 2009/05/24 17:43 (external edit)
 
Driven by DokuWiki