Nothing is however said about *how* the errors are detected. Can an ML expert ch...

thibaut-duguet · on April 16, 2020

I'm a Product Manager at Deepomatic and I have been leading the study in question here. To detect the errors, we trained a model (with a different neural network architecture than the 6 listed in the post), and we then have a matching algorithm that highlights all bounding boxes that were either annotated but not predicted (False Negative), or predicted but not annotated (False Positive). Those potential errors are also sorted based on an error score to get first the most obvious errors. Happy to answer any other question you may have!

Zenst · on April 16, 2020

Was the corrected datasets larger or smaller than the originals?

Would also be interesting to see these improved datasets run thru simulation of crashes with existing datasets and see how they handle? Though not sure how you would go about that beyond approaching current providers of such cars for data to work thru and suspect they may be less open to admitting flaws and with that, may be a stumbling block.

Certainly makes you wonder how far we can optimise such datasets to get better results. I know some ML datasets are a case of humans fine tuning and going thru examples and classifying them, and wonder how much that skews or effects error rates as we all know humans error.

thibaut-duguet · on April 16, 2020

To answer your first question, we had both bounding boxes added and removed, and depending on the dataset, the main type of error was different (I'd say it was overall more objectifs that were forgotten, especially small objects).

It would indeed be very interesting to see the impact of those improved datasets on driving, which is ultimately the task that is automated for cars. We've been working on many projects at Deepomatic not only related to autonomous cars, and we did see some concrete impact of cleaning the datasets beyond performance metrics.

wodenokoto · on April 17, 2020

So in the article you write that you found 20% errors in the data, but at what point do you conclude that “this is an error in the data” and “this is an error in the prediction”?

Is that done manually?

Also, do you have strategy for finding errors, where the model learned to mislabel items in order to increase its score? (E.g, red trucks are labeled red cars in both train and test)

thibaut-duguet · on April 17, 2020

There was indeed a manual review of the "potential errors" highlighted by our algorithm to determine is it was indeed an error in the data or if it was an error in the prediction. The 20% corresponds to the proportion of objects that was corrected with this manual review. So it's actually likely that some errors (that were not found by our algorithm) are still in our clean version of the dataset.

liquidify · on April 16, 2020

Curious if you could find errors by comparing the results from the different models. Places where models disagree with each other more often would be areas that I would want to target for error checking.

thaumasiotes · on April 16, 2020

> Places where models disagree with each other more often would be areas that I would want to target for error checking.

This is a great idea if your goal is to maximize the rate at which things you look at turn out to be errors. (On at least one side.)

But it's guaranteed to miss cases where every model makes the same inexplicable-to-the-human-eye mistake, and those cases would appear to be especially interesting.

thibaut-duguet · on April 17, 2020

This is a good idea and there are actually 2 objectives when one wants to clean its dataset:

- you might want to optimize your time and correct as many errors as you can as fast as you can. Using several models will help you ion that case, adn that's actually what we've been focusing on so far.

- you might want to find the most ambiguous cases where you really need to improve your models as those edge cases are the ones causing the problems you have in production. Those 2 objectives are quite opposite. In the first case, you want to find the "easiest" errors, while in the other one, you want to focus on edge cases and you then probably need to look at errors with intermediate scores, where nothing is really sure..

wodenokoto · on April 17, 2020

You do that with human annotators.

“Annotator agreement” is a measure of confidence in the correctness of labels. And you should always keep an eye out for how these are handled, when reading papers that present a dataset.

Saying we should start doing model agreement is a really good idea imho.

rathel · on April 16, 2020

Thank you for the explanation.

alexchamberlain · on April 16, 2020

ie you get some to check where the model and the annotations disagree.

ArnoVW · on April 16, 2020

my guess would be using some sort of active learning. In other words: 1) building a model using the data set 2) making predictions using the training data 3) finding the cases where the model is the most confused (difference in probability between classes is low) 4) raising those cases to humans

https://en.wikipedia.org/wiki/Active_learning_(machine_learn...

shihab · on April 16, 2020

plus we'll have to register simply to see a few examples of mislabeling...that was disappointing

thibaut-duguet · on April 16, 2020

I've added screenshots of errors in the blogpost so that you have an idea of the errors we spotted. Let me know what you think of them.

thaumasiotes · on April 16, 2020

A couple notes on those screenshots:

- In the cars-on-the-bridge image, the red bounding box for the semitruck in the oncoming lanes is too small, with its upper bound just above the top of the semi's windshield, ignoring the much taller roof and towed container.

- In the same image, there are red bounding boxes around cars that exist, and also red bounding boxes around non-cars that don't exist. If false positives and false negatives are going to be represented in the same picture, it'd be nice to use different colors for them, so the viewer can tell whether the error was identified correctly or spuriously.

- I have trouble understanding the "bus" screenshot. The caption says "(green pictures are valid errors) – The pink dotted boxes are objects that have not been labelled but that our error spotting algorithm highlighted." In other words, the green-highlighted pictures are false negatives considered from the perspective of the original data set, and the red-highlighted pictures are true negatives. Or alternatively, the green-highlighted pictures are true positives from the perspective of the error-spotting algorithm, and the red-highlighted pictures are false positives. What confuses me is that all 9 pictures are labeled "false positive" by the tabbing at the top of the screenshot.