Hacker News new | past | comments | ask | show | jobs | submit login
Math versus Dirty Data (jeremykun.com)
103 points by furcyd on July 25, 2019 | hide | past | favorite | 16 comments



Is it the new normal to vent about your job on a public blog post, talking about how it's not very exciting and how it involves mostly "poop-smithing," and how you'd be thinking of quitting if you didn't have other stuff you were doing, and for the company to be totally OK with that?


A long time ago HN used to be a place where people had a general distrust of large corporations and it was much more impressive to be working for an ambitious startup than being highly paid an a large company just grinding out the status quo.

I'm really happy to see posts like this here. Jeremy has done lost of great work and so it's somewhat refreshing to see that even someone like him struggles at times at a place like google. It's also important because it show that FAANG isn't all it's cracked up to be and so it is a completely legitimate career path to remain in small startups, maybe getting paid less, but having more fun getting paid.

I've worked at both the big and small and this post definitely resonates with me and certainly makes it easier to go into work knowing that I'm not the crazy one.


> It's also important because it show that FAANG isn't all it's cracked up to be and so it is a completely legitimate career path to remain in small startups

I don't think there's just these 2 options. There's the other 90% in between.


Tbh it’s probably healthy. I think lots of people that have a good handle on the abstract world have a constant sense of what could be if only X was different about the real world.

It’s an infuriating internal battle that lots of others can relate to so makes for pleasurable reading.

For companies to resent it... I think you may have a point. It’s hardly great PR and actually gives you a good reason to discount whatever the marketing says.


its pretty common and it seems to be getting popular with sites like blind and fishbowl


It's funny how Google is perceived in the public as an all-knowing organism, gobbling up petabytes of data public and private to fuel a superior intelligence... Actually it can't even get clean data about its own internal operation...


Do you believe the two scenarios are mutually exclusive?

I've definitely seen people be extremely disorganised in some aspects of life but also sufficiently, moderately, or fairly, successful in others.

I suspect companies are also capable of exhibiting similar behaviours.


Well if you collect data on a massive scale like that assuming it would be clean is actually the grossly unreasonable part.


I dealt with this issue in my PhD research on data mining, and it motivated me to search for a new line of work. The benefits of data based products never justify the customers changing everything just to make life easier for the data nerd in the corner (in my case at least). What made them happiest was when i assembled some basic tools to mine their data on their own - "teaching a man to fish, ruining a business opportunity".

I think dirty data are like the refs in a football game. Nobody comes to see them, but they'll be part of the game until you have perfect players.


In most organizations I have interacted with getting data with known sampling is always a problem. “The temperature sensor sends a measurement every 5 minutes.” Except, I learned after a lot of debugging, when the air-conditioner gets maintained and the readings become a fixed value.

It’s not the organizations fault. Discovering and maintaining a known sampling for the data is part of the process. I consider it a win when I can get a team to accept that the data process will need to be debugged just like code.


I just came back from an oil and gas conference, and this dog running after missing/wrong data is painfully accurate. Datasets are often partially or completely suspect. Two geologists might have contradictory opinions regarding the interpretation of logging data. Even when you have good data, it might not be useful six feet into a formation.


> I don’t have data-intensive applications or problems of scale, but rather policy-intensive applications.

As I understand it, this problem - "policy-intensive" - is a lot of changes of requests from the customers of the system. In other words, customers don't know fully what they need and produce a stream of requests. This stream or requests may converge on some stable "global" requirements, or, alternatively, may represent "moving target" (not converge).

Additionally, some of those requests are caused by different (supposedly better) understanding of the nature of data - the object of the system. That is, with evolution of the system customers (with the help of developers) understand more and more specific details about the data - missing parts, ambiguous parts, alternative sets of attributes etc.

The approach for such problems, which so far is most promising, is to organize the system as a set of independent, composable as much as possible operations. When another request comes from a customer, or another detail about the data becomes known, the system built from such composable components better allows incremental modifications towards processing such a change.

A good set of operations sometimes develops with time. This approach requires constant reflection on what a particular change mean to the existing process, uncovering assumptions and making them explicit and changeable...


I now want to start a math cafe. Chalkboards. Coffee. Hang pieces from local generative artists. I'll stop short of making puns about pie. But it'd be nice. I already know what to name it! Satz. (German for theorem / coffee residue).

"A mathematician is a machine for turning coffee into theorems."


If anyone is interested, we are working on a solution to the dirty data problem at http://treenotation.org.


I'm sorry to sound pessimistic, but I don't think that there's any technical solution to dirty data.

The problem is usually not "This data was sent as a JSON without any schema and with syntax errors", it's "This avro file has a completely useless schema (e.g. everything is typed as string|null) and there are multiple enumerations where the same value is encoded as 3 different strings (e.g. yes, y, true)"


If we make it 10x+ easier to define schemas and reuse existing ones, by making authoring, concatenation, extending, sharing, and discovering such schemas easier; and provide great translation mechanisms to/from existing languages like sql, Json, csv and xml; and provide new auto scheme generation and data fixing from better deep learning models....I think we can get there.

We are working on all of that




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: