Training Machine Learning Models Using Noisy Data

Nathan Silberman

Troy: What's wrong with me?
Dr. Zaius:
I think you're crazy.
I want a second opinion.
Dr. Zaius:
You're also lazy.

The Simpsons: A Fish Called Selma

The concept of a second opinion in medicine is so common that most people take it for granted, especially given a severe diagnosis. Medicine is hard. Really hard. Disagreement between two doctors may be due to different levels of expertise, different levels of access to patient information or simply human error. Like all humans, even the world’s best doctors make mistakes.

At Butterfly, we’re building machine learning tools that will act as a second pair of eyes for a doctor and even automate part of their workflow that is laborious or error prone. Our goal is to allow the doctor to spend less time in front of a screen and more time engaged with the patient.

Most of these tools take an image as input and perform some kind of interpretation. For example, does this image contain an aorta? Is this patient’s heart unhealthy? To enable such AI tools, we need to collect a massive dataset of images (the inputs) as well as the interpretation we want the machine to make (the outputs). This quickly introduces a problem. How do we train a model when doctors themselves don’t agree on what the machine should predict?

In this article, we present a real world example of this problem along with a novel solution, developed by Butterfly Intern Ryutaro Tanno, which is to be published and presented at the upcoming CVPR 2019 conference. We’ve included some of the math underlying this approach but have for the most part attempted to provide a conceptual overview in this post. For the complete mathematical underpinnings of the approach, see our CVPR paper.

Ultrasound View Classification

One of the most basic operations we’d like to perform is to automatically identify what part of the body is being imaged by the Butterfly iQ at any given time. Is the doctor imaging the kidney? The heart? The lungs?

Ultrasound images of the lungs, kidney, carotid artery and heart.

The problem of view identification is easily formulated as a supervised classification problem. We want to find the function that maps an image to one of K discrete labels Formally:

where f is the function we want to learn, x is the image, theta are the model parameters and y is a label from the set

This standard classification formulation assumes, however, that there exists a single unambiguous ground truth.

At Butterfly, we work with a large number of expert clinicians to label data for machine learning. Using an internal annotation tool, clinicians are presented with an image or video and are asked to indicate which Ultrasound view (out of K choices) is shown. This has enabled us to collect a large dataset of images and labels


indexes the image and

represented the corresponding label provided by annotator a. In practice, we often give the same image to multiple annotators and obtain multiple labels for the same image which disagree with one another. For example, one annotator might say the view is an Apical 3-Chamber view and another might call it an Apical 4-Chamber view. Formally, we observe:


How do we go about training a classification model when each image doesn’t have a single unambiguous ground truth?

Solution 1: The Wisdom of Crowds

In his book of the same name, writer James Surowiecki points out that the Wisdom of Crowds can be a very effective mechanism for predicting a value from noisy estimates. The earliest and most famous example of this phenomenon involves statistician Francis Galton. In 1906, he visited a fair where attendees were asked to guess the weight of an ox on display. He observed that the median of these (novice) estimates was closer to the truth than those produced by experts. This experiment has been replicated many times and the solution of “just take the mean, median or mode” is a common and often useful choice for machine learning practitioners when dealing with noisy classification labels.

Unfortunately, in the medical domain, the solution “get every image labeled multiple times” has an unfortunate flaw. It requires that you label every image multiple times! When labels are inexpensive, this may not be a problem. But in medicine and in other fields when only experts know how to interpret an image, redundant image annotation can be extremely expensive. To avoid this costly solution, we need to do something smarter.

Solution 2: Modeling Annotator Confusion

Our approach assumes that there exists a ground truth which is, with some probability, mistakenly changed by each annotator. In other words, when annotating an image, the annotator mostly gets it right but with some probability makes a mistake. Formally, we assume that the annotators are statistically independent:


represents the label provided by annotator i,

is the probability that annotator r corrupted label y and transformed it to


models the probability that image x has label y. The two terms on the right-hand side are modeled separately. The first is modeled via a confusion matrix and the second via a neural network.

Confusion Matrices

A confusion matrix models the frequency with which a sample of one class is mistaken for another.  

In the confusion matrix above, apples are always correctly predicted to be apples, Jackfruit are correctly predicted 50% of the time and mistaken for Durian the other 50% of the time, and Durian and correctly labeled 60% of the time but mistaken for Jackfruit 40% of the time.

Confusion matrices are convenient mechanisms for modeling the confusion that an agent (human or machine) exhibits between classes. They should illustrate which classes the agent correctly predicts and which they make mistakes on. Keep this in mind as we demonstrate below how we use them to model the label corruption process.

Our Model

We model the problem of training from noisy labels as follows. First, a Convolutional Neural Network (CNN) classifier is fed an image x which produces an estimated label probability distribution

This distribution is then corrupted by each annotator’s confusion matrix

to produce the annotators probability distribution

During training we train the model to produce probability distributions

which match the labels produced by the annotator

The machine learning problem we solve, therefore, can be phrased as the problem of learning (a) neural network weights and (b) confusion matrix entries

such when given an image x, our neural network produces a probability distribution over the true labels such that when the true labels are corrupted by each annotators’ corresponding confusion matrix, the corrupted label matches (as often as possible) the label actually provided by the annotator.

A very useful side-effect of this approach is that once trained, the confusion matrices themselves also give us an estimate of the annotator skill level! We’ll highlight this feature when we test on real-world data.


We evaluated our method using two datasets. First, we used MNIST, a digit dataset and ubiquitous test bed in computer vision, to evaluate our model when we know with certainty the ground truth in both evaluation and test. This regime allows us to stress-test our method before applying it to real data. Second, we used a real-world dataset of cardiac Ultrasound images where we observed a high level of annotator noise during training.

We began by generating a variant of MNIST where we artificially introduced several virtual annotators. Each virtual annotator has a different level of expertise and makes mistakes in different places. Is our model capable of recovering the truth in the presence of noise?

As illustrated in the chart above, we compared against several state of the art methods in two regimes. In the first regime, every image was noisily labeled by every annotator. In the second regime, every image was noisily labeled by only a single annotator. Our results indicate that our model not only is less sensitive to noise than prior art, but also achieves better accuracy. More importantly, our model operates well even when we have a single label per image!  

While accuracy and the ability to avoid the acquisition of multiple labels for each image are critical, we also evaluated how fast our neural network converges. Results for this experiment, shown above, illustrate that our model also converges very quickly.

Finally, we evaluated our model on a dataset of real-world cardiac Ultrasound data. For our training data, 6 credentialled sonographers labeled 24,000 images. For the test data, 3 more experienced sonographers labeled 22,000 images and were limited to those images where there was consensus. We also included annotators from two non-experts (read: engineers!) to help validate our model’s sensitivity.

As the graphs above illustrate, our model not only outperformed the prior art in terms of accuracy (d), it also was able to very nearly model the actual confusion matrices (c) and was able to pick out the skill levels of the experts versus the novice annotators.

Summary And Future Work

We’ve introduced a new neural network architecture that allows us to train machine learning classifiers in the presence of highly noisy annotations. This architecture not only performs well in terms of accuracy, it also trains quickly, allows us to avoid the costly strategy of labeling every image by every annotator and even models annotator skill levels.

This result was very exciting for us as it has allowed us to change the way we annotate data as well as determine, automatically, which annotators were performing better than others.

That being said, while our model is able to deal with noisy training data, the evaluation regimes above both assumed we had access to unambiguous ground truth during evaluation. This assumption is ubiquitous in research in this area. However, there are many problems we work on where unambiguous ground truth is unavailable at evaluation time as well. How do we deal with this? We’ll be discussing this in a future blog post!