A novel general-purpose crowd layer is proposed, which allows us to train deep neural networks end-to-end, directly from the noisy labels of multiple annotators, using only backpropagation.
Over the last few years, deep learning has revolutionized the field of machine learning by dramatically improving the stateof-the-art in various domains. However, as the size of supervised artificial neural networks grows, typically so does the need for larger labeled datasets. Recently, crowdsourcing has established itself as an efficient and cost-effective solution for labeling large sets of data in a scalable manner, but it often requires aggregating labels from multiple noisy contributors with different levels of expertise. In this paper, we address the problem of learning deep neural networks from crowds. We begin by describing an EM algorithm for jointly learning the parameters of the network and the reliabilities of the annotators. Then, a novel general-purpose crowd layer is proposed, which allows us to train deep neural networks end-to-end, directly from the noisy labels of multiple annotators, using only backpropagation. We empirically show that the proposed approach is able to internally capture the reliability and biases of different annotators and achieve new state-of-the-art results for various crowdsourced datasets across different settings, namely classification, regression and sequence labeling. Introduction In the last decade, deep learning has made major advances in solving artificial intelligence problems in different domains such as speech recognition, visual object recognition, object detection and machine translation (Schmidhuber 2015). This success is often attributed to its ability to discover intricate structures in high-dimensional data (LeCun, Bengio, and Hinton 2015), thereby making it particularly well suited for tackling complex tasks that are often regarded as characteristic of humans, such as vision, speech and natural language understanding. However, typically, a key requirement for learning deep representations of complex high-dimensional data is large sets of labeled data. Unfortunately, in many situations this data is not readily available, and humans are required to manually label large collections of data. On the other hand, in recent years, crowdsourcing has established itself as a reliable solution to annotate large collections of data. Indeed, crowdsourcing platforms like Amazon Mechanical Turk1 and Crowdflower2 have proven to be an Copyright c © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. http://www.mturk.com http://crowdflower.com efficient and cost-effective way for obtaining labeled data (Snow et al. 2008; Buhrmester, Kwang, and Gosling 2011), especially for the kind of human-like tasks, such as vision, speech and natural language understanding, for which deep learning methods have been shown to excel. Even in fields like medical imaging, crowdsourcing is being used to collect the large sets of labeled data that modern data-savvy deep learning methods enjoy (Greenspan, van Ginneken, and Summers 2016; Albarqouni et al. 2016; Guan et al. 2017). However, while crowdsourcing is scalable enough to allow labeling datasets that would otherwise be impractical for a single annotator to handle, it is well known that the noise associated with the labels provided by the various annotators can compromise practical applications that make use of such type of data (Sheng, Provost, and Ipeirotis 2008; Donmez and Carbonell 2008). Thus, it is not surprising that a large body of the recent machine learning literature is dedicated to mitigating the effects of the noise and biases inherent to such heterogeneous sources of data (e.g. Yan et al. (2014); Albarqouni et al. (2016); Guan et al. (2017)). When learning deep neural networks from the labels of multiple annotators, typical approaches rely on some sort of label aggregation mechanisms prior to training. In classification settings, the simplest and most common approach is to use majority voting, which naively assumes that all annotators are equally reliable. More advanced approaches, such as the one proposed in (Dawid and Skene 1979) and other variants (e.g. Ipeirotis, Provost, and Wang (2010); Whitehill et al. (2009)) jointly model the unknown biases of the annotators and their answers as noisy versions of some latent ground truth. Despite their improved ground truth estimates over majority voting, recent works have shown that jointly learning the classifier model and the annotators noise model using EM-style algorithms generally leads to improved results (Raykar et al. 2010; Albarqouni et al. 2016). In this paper, we begin by describing an EM algorithm for learning deep neural networks from crowds in multi-class classification settings, highlighting its limitations. Then, a novel crowd layer is proposed, which allows us to train neural networks end-to-end, directly from the noisy labels of multiple annotators, using only backpropagation. This alternative approach not only allows us to avoid the additional computational overhead of EM, but also leads to a generalpurpose framework that generalizes trivially beyond classiThe Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)