> AI helps provide good analysis without having to cleaning up data manually. My...

throw_away_777 · on Dec 3, 2016

When you say "clean data", what exactly do you mean? I've often seen this claim that cleaning data takes a lot of time, but it seems like an ill-defined term.

bonoboTP · on Dec 3, 2016

It can mean different things.

In general: duplicate data, missing fields, different formats for different parts of the data, inconsistent naming schemes

For text: character encodings, special symbols, escape characters, punctuation, extra or missing spaces and newlines, capitalization

For images: different sizes, rotations, crops, blurry images

For numbers: inconsistent decimal point/comma, outliers with obviously nonsense values or zeros, values in different units of measurement etc.

nostrademons · on Dec 3, 2016

For user behavior: bots, clickfraud/clickjacking, bored teenagers, competitors who are sussing out your product, people who got confused by your user interface, users who have Javascript disabled and so never trigger your clicktracking, users who are on really old browsers who don't have Javascript to begin with.

And then there's bugs in your data pipeline: browser (particularly IE) bugs, logging bugs, didn't understand your distributed databases's conflict resolution policy bugs, failed attempts at cleaning all the previous categories, incorrect assumptions about the "shape" of your data, self-DOS attacks (no joke - Google almost brought down itself by having an img with an empty src tag, which forces the browser to make a duplicate request on every page) which result in extra duplicate requests, incorrectly filtering requests so you count /favicon.ico as a pageview, etc.