Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> AI helps provide good analysis without having to cleaning up data manually.

My own experience has shown that dirty data impacts advanced AI just as much as it impacts far more basic ML techniques.

Even for the most advanced AI we work on, we spend just as much time worrying about clean data as we do anything else.



When you say "clean data", what exactly do you mean? I've often seen this claim that cleaning data takes a lot of time, but it seems like an ill-defined term.


It can mean different things.

In general: duplicate data, missing fields, different formats for different parts of the data, inconsistent naming schemes

For text: character encodings, special symbols, escape characters, punctuation, extra or missing spaces and newlines, capitalization

For images: different sizes, rotations, crops, blurry images

For numbers: inconsistent decimal point/comma, outliers with obviously nonsense values or zeros, values in different units of measurement etc.


For user behavior: bots, clickfraud/clickjacking, bored teenagers, competitors who are sussing out your product, people who got confused by your user interface, users who have Javascript disabled and so never trigger your clicktracking, users who are on really old browsers who don't have Javascript to begin with.

And then there's bugs in your data pipeline: browser (particularly IE) bugs, logging bugs, didn't understand your distributed databases's conflict resolution policy bugs, failed attempts at cleaning all the previous categories, incorrect assumptions about the "shape" of your data, self-DOS attacks (no joke - Google almost brought down itself by having an img with an empty src tag, which forces the browser to make a duplicate request on every page) which result in extra duplicate requests, incorrectly filtering requests so you count /favicon.ico as a pageview, etc.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: