Minimax Optimal Convergence Rates for Estimating Ground Truth from Crowdsourced Labels

  • Chao Gao ,
  • Denny Zhou

MSR-TR-2013-110 |

Most machine learning challenges are essentially caused by insufficient amount of training data. In recent years, there is a rapid increase in the popularity of using crowdsourcing to collect labels for machine learning. With the emerging crowdsourcing services, we can obtain a large number of labels at a low cost from millions of crowdsourcing workers world wide. However, the labels provided by those non-expert crowdsourcing workers might not be of high quality. To fix this issue, in general, crowdsourcing requesters let each item be repeatedly labeled by several different workers and then estimate ground truth from collected labels. In practice, Dawid-Skene estimator has been widely used for this kind of estimation problem. It is a method proposed more than thirty years ago, but somehow there has not been any theoretic result on its convergence rate in the literature. In this paper, we fill the gap by establishing minimax optimal convergence rates for Dawid-Skene estimator. We obtain a lower bound which holds for all estimators and an upper bound for Dawid-Skene estimator. We show that the upper bound matches the lower bound. Thus, Dawid-Skene estimator achieves the minimax optimality. Moreover, we conduct a comparative study of Dawid-Skene estimator and majority voting. We highlight the advantages and possible drawbacks of Dawid-Skene estimator through rigorous analysis in various settings.