As I mentioned down-thread, I can generate a CSV with a couple of fprintf statements and a loop. I definitely can't do that with .xlsx. There is almost zero friction to bolting CSV export capability to an existing system, which is part of why it's so popular.
You can write what "looks" like CSV to you, but there are no guarantees it will import correctly.
The problem is 10x worse when you get CSV from one source and rely on another process to load it. I fought this problem for several days going from NetSuite to Snowflake via CSV.
Can you give an example? The rules for CSV files are so simple I'm struggling to imagine a case where something looks correct but in fact isn't correct.
That sounds like the problem of badly formatted CSV, not a problem with CSV per se.
If you stick to one delimiter, and that delimiter is a comma, and escape the delimiter in the data with double-quotes around the entry, and escape double quotes with two double-quotes, well, you have written CSV that is correct and looks correct and will be parsed correctly by literally every CSV parser.
That's really not a serious argument against CSV. Since you paraphrase in a silly way, I can do it too! Your "argument" is "Badly formatted files exist, therefore CSV bad".
Everyone "against CSV" seems to be arguing against badly formatted CSV, and leaping to the conclusion that "CSV is just bad" without much more to say about it. I'm sorry that badly formatted CSV gave you a bad time, but the format is fine and gets its job done.
"It doesn't have x, y or z feature therefore no one should be using it ever" is kind of a dumb argument, honestly.
> Your "argument" is "Badly formatted files exist, therefore CSV bad".
The argument is actually that the badly formatted CSV files have taken over, therefore CSV is bad. You can't reject them, so your import becomes unreliable.
Me, a naive idiot: CSV is simple I will write my own exporter because I am clever
Me, 20 minutes later: Heh that was easy I am a genius
Me, 21 minutes later: Unicode is ruining my life T_T
Don't get me wrong, I really like CSV because it's so primitive and works so well if you are disciplined about it. But it's easy to get something working on a small dataset and forget all the other possibilities only to faceplant as soon as you step outside your front door. In the case above my experience with dealing with CSV data from other people made me arrogant, when I should have just taken a few minutes to learn my way around a mature library.
In UTF-8, the byte for a comma and a quote only exist as their characters. They don't exist as parts of multibyte sequences, by design.
If you have Unicode problems, then you have Unicode problems, but they wouldn't seem to be CSV problems...? Unless you're being incredibly sloppy in your programming and outputting double-byte UTF-16 strings surrounded by single-byte commas and quotes or something...?
If you're manually generating your own CSV files, you probably know what kind of data you are generating and consequently whether your data is going to contain commas. If commas and newlines don't exist in your data, then you can safely ignore quoting rules when generating CSV files. I know that I've generated CSVs in the past and rather than figuring out the correct way to quote the strings, I just removed any inconvenient characters without any loss to the data at all. Obviously this is not "correct" but you don't have to implement cases if you know they won't show up.
This is true, but a lot of data processing takes place in a context where frictionless export functionality is more important than a 100% guarantee of import compatibility. I'd rather ingest city = ",CHANGSHA,HUNAN" (real example!) than ingest nothing at all because my vendor doesn't have time to integrate a JSON serializer.
> but there are no guarantees it will import correctly.
What do you mean, there are "no guarantees"? You are in charge! You know what data you're dumping, you can see if it imports well. You can tailor your use case.
That's not the same as getting a CSV from some dump, where you have limited (if any) control over the behavior.
> As I mentioned down-thread, I can generate a CSV with a couple of fprintf statements and a loop.
And usually generate garbage for anything but the most trivial case, which really nobody gives a shit about. That's the main reason why CSV absolutely sucks too, you have to waste month diagnosing the broken shit you're given to implement the workarounds necessary to deal with it.
> I definitely can't do that with .xlsx.
You probably can though. An xlsx file is just a bunch of XML files in a zip.
Define "garbage." If I know what my data looks like, I can anticipate the edge cases ahead of time. Plenty of CSV exports work this way, they don't need to be general if the schema is already imposed by the system.
Have you ever worked in embedded systems? Writing XML files and then zipping them on a platform with 32 kilobytes of RAM would be hell. CSV is easy, I can write the file a line at a time through a lightweight microcontroller-friendly filesystem library like FatFS.
I know this is HN and we like to pretend we're all data scientists working on clusters with eleventy billion gigs of RAM, but us embedded systems folks exist too.
Incorrect encoding, incorrect separators (record and field both), incorrect escaping / quoting, etc…
> If I know what my data looks like
If you control the entirety of the pipeline, the format you're using is basically irrelevant. You can pick whatever you want and call it however you want.
> Have you ever worked in embedded systems? Writing XML files and then zipping them on a platform with 32 kilobytes of RAM would be hell. CSV is easy, I can write the file a line at a time through a lightweight microcontroller-friendly filesystem library like FatFS.
You can pretty literally do that with XML and zip files: write the uncompressed data, keep track of the amount of data (for the bits which are not fixed-size), write the file header, done. You just need to keep track of your file sizes and offsets in order to write the central directory. And the reality's if you're replacing a CSV file the only dynamic part will be the one worksheet, everything else will be constant.
> If you control the entirety of the pipeline, the format you're using is basically irrelevant.
I think you are missing the point -- you only need to know about generator to know about format.
Since the parent poster was talking embedded, here is one example: a data logger with tiny embedded records tuples: (elapsed-time, voltage, current). You need this to be readable in the widest variety of programs possible. What format do you use?
I think the answer is pretty clear: CSV. It is compatible with every programming language and spreadsheet out there, and in a pinch, you can even open it in text editor and manually examine the data.
Using something like XLSX here would be total craziness: it will make code significantly bigger, and it will severely decrease compatibility.
I have a script which generates a CSV file using a bunch of print statements. The columns are hostnames and some numbers, so it is never going to contain commas or newlines. This will be perfectly valid CSV every time.
That’s why CSV is absolutely beautiful - there is a huge number of applications that people really care about, and their data is constrained enough that there is not need to care about CSV escaping and need for any third party libraries.
Creating XSLX file by hand is possible, but this will be a large amount of code and I wouldn’t include this in my script, it would need to be a separate library - which means build system support, learning the API etc...
Even if it were true it wouldn't matter a whit to the production side of the format, which is what "produce CSVs using fprintf" is: excel can consume them all.
As soon as you open it in Excel, it's garbage anyway, since it will replace date-like items with nonsense, drop number digits, convert anything it can, reencode monetary unities, and so on.
If you don't open it in Excel, you can have as strict a parser as you want, just like any other format.
> If you don't open it in Excel, you can have as strict a parser as you want, just like any other format.
No, you can not. Because the CSV format is so fuzzy you can very easily parse incorrectly and end up with a valid parse full of garbage.
Trivially: incorrect separator, file happens to not contain that separator at all, you end up with a single column. That's a completely valid file, and might even make sense for the system. Also trivially: incorrect encoding, anything ascii-compatible will parse fine as iso-8859-*. Also trivially: incorrect quoting / escaping, might not break the parse, will likely corrupt the data (because you will not be stripping the quotes or applying the escapes and will store them instead).
It's like you people have never had to write ingestion pipelines for CSVs coming from randos.
>It's like you people have never had to write ingestion pipelines for CSVs coming from randos.
That's because this is not what this thread is about.
The comment you're responding to is not about CSVs coming from "randos". It's for the case where that rando is you, so you can make sure the problems you mention don't happen on the generation side of CSVs.
Because there is no way to perform that enforcement, becauae you say “CSV” and people understand “my old garbage” and you can’t fix that; and because millions of incorrect documents will yield a valid but nonsensical parse.
People doesn't matter which format is, since it simple work in any spreadsheet software.
Yeah i can generate any file with a bunch of printf, but csv i dont have to read a specification, i its possible to read with a bunch of read without have to use a xml or xlsl library.
> That's the main reason why CSV absolutely sucks too [...]
Is it? I think you're absolutely right that naive points of view like the one you're responding to will lead to avoidable bugs, but I'm not so sure the problem is CSV so much as people who assume CSV is simple enough to parse or generate without using a library.
> I'm not so sure the problem is CSV so much as people who assume CSV is simple enough to parse or generate without using a library.
The simplicity of CSV is what tells people that they can parse and generate it without a library, and even more so that that's a feature of CSV. You just had to read the comments disagreeing with me to see exactly that.