Every few years an article like this pops up. I find it tiring - because they are primarily from a software engineer's viewpoint who is probably trying to write a parser and needs to handle the edge cases. As a data scientist, I receive and process around 75GB of CSV every day - of course I don't process it manually. Our processes have been running a few years now and millions of dollars of revenue rides on it. I don't see any need to retire anything, and am pretty sure our processes will be working and running 10 years from now. The only thing which will change are some statistical models which make use of this data.
Yeah - I used to lead a department that would process somewhere around 10TB of CSV formatted data per day.
The edge cases are a hassle but they don't become less of a hassle from a business perspective by switching to json or really any other format. We tried an experiment of using more json and eventually gave it up because it wasn't saving any time at a holistic level because the "data schema" conversations massively dominated the entirety of the development and testing time.
Obviously being able to jam out some json helped quite a bit initially, but then on the QA side we started to run in to problems with tooling not really being designed to handle massive json files. Basically, when something was invalid (such as the first time we encountered an invalid quote) it was not enjoyable to figure out where that was in a 15GB file.
That said, I fully concur with the general premise that CSV doesn't let you encode the solutions to these problems, which really really sucks. But, to solve that, we would output to a more columnar storage format like Parquet or something. This would let us fully encode and manage the data how we wanted while letting our clients continue working their processes.
What I would really like to see is a file format where the validity of the file could be established by only using the header. E.g. I could validate that all the values in a specific column were integers without having to read them all.
Really appreciate the insight from you and the GP here. I have been struggling with data format decisions around a personal project that will only be used by a few people, being unsure about the extent i should try to make it bulletproof (but harder to maintain and modify) or just keeping it simple (but primitive). It's helpful to see an experienced professional perspective showing that you can fall into a tooling rabbit hole at any scale.
> "data schema" conversations massively dominated the entirety of the development and testing time.
Agreed. JSON let's me know something is a number. That's great, but I still have to check for min/max,zero etc. A string? That's great, but I got to check it against a set of enums, and so forth. Basically, the "types" JSON gives you is about 20% of the work, and you're going to have to parse things into your own types anyway.
> What I would really like to see is a file format where the validity of the file could be established by only using the header.
Are you saying something like a checksum so not only is a schema provided but some method to verify that the data obeys the schema?
If you're talking about just some stronger shared ontology, I think that's a direction things will go. I call this concept "Type the world" or "World Wide Types". I'm starting to think something like GPT-N will be the primary author, rather than a committee of humans like Schema.org.
Honestly with the schema thing I'd probably be fine with either/or!
A checksum would be crude and user-hostile, only being able to say "you did it wrong" but not really good at tell you what it means to do it right.
If I understand the concepts correctly then it seems like a shared ontology could potentially solve the problem in a non-hostile way.
Plus, it makes me happy because I feel like types are a real-world problem, so it is always nice if the type system could enforce that real-world-ness and all the messiness that comes along for the ride.
We looked at it and there were a few problems we had with where it would force us to put VMs that we just weren't super comfortable with due to the in-process-ness.
More a byproduct of decisions made 5 - 7 years ago when the company was in raw startup mode versus a more mature roadmap.
That's pretty much how I feel as well. People that complain about all the variations in CSV "formats" out there aren't wrong but in practice, for me anyway, most of the files I work with in any given situation come from the same source or a small set of sources and use the same variant.
If/when I get a data source introduced into my workflow that differs from this variant I come up with a routine to normalize it, integrate that into my workflow and move on.
Pretty much the opposite here. Dealing with lots of different formats from different sources, often once off. There's all kinds of crazy things people will call "data" (CSV 'variants' are the tip of the iceberg).
So you can constrain what type of CSV you will allow and if this happens it will bail. It's that simple. There is nothing wrong with having additional constraints on top of just saying it must be "CSV" especially in these scenarios.
I'm in a similar situation, we've been using CSV for over a decade to move billions of dollars worth of product each year. It just works.
> I'm pretty sure most devs are going to use whatever CSV library that comes with their language. When that breaks, it's generally not a simple fix.
Call me a yak-shaver, but in every language I've worked with I've written my own csv parsing library when I needed one.
It's such a trivial thing for the majority cases (varying of delimiters, line-endings, ascii/uft8, quoting, escaping, and embedding of delimiters/line-endings) that it takes barely no time at all after you've done it once in another language. Of course there are edge cases and special cases depending upon specific workloads, but if your team has it's own parser (which is a small amount of obvious code) then it does indeed usually become a simple fix.
Sounds good using someone else's library, but below a certain complexity it's rarely worth it in the medium to long term except for when doing proof of concept or demo code, or if the domain space is complex.
We've got at least a few dozen customer integrations that parse CSV-ish files, and they all have a custom parser. Many of these have been chugging for over a decade, sending "mission critical" data back and forth.
It's dead simple to whip up, and we can easily tweak it to whatever the customers software spits out, like one field suddenly being UTF-8 encoded in an otherwise Windows-1252 file.
100%. Don’t fix if it ain’t broken. This is the pragmatic approach that often comes across to new devs as “unsexy”. I got some news for “unsexy” software - it works, and it brings the revenue.