As I mentioned down-thread, I can generate a CSV with a couple of fprintf statem...

turtlebits · on Aug 18, 2021

You can write what "looks" like CSV to you, but there are no guarantees it will import correctly.

The problem is 10x worse when you get CSV from one source and rely on another process to load it. I fought this problem for several days going from NetSuite to Snowflake via CSV.

asdff · on Aug 18, 2021

Can you give an example? The rules for CSV files are so simple I'm struggling to imagine a case where something looks correct but in fact isn't correct.

turtlebits · on Aug 18, 2021

Non standard delimiters. Escaping delimiters in fields - sometimes with a \, sometimes doubled (""), sometimes not at all. Double new lines.

Poor handling from standard CSV libraries. Either unable to read or unable to create for some downstream process.

rendall · on Aug 18, 2021

That sounds like the problem of badly formatted CSV, not a problem with CSV per se.

If you stick to one delimiter, and that delimiter is a comma, and escape the delimiter in the data with double-quotes around the entry, and escape double quotes with two double-quotes, well, you have written CSV that is correct and looks correct and will be parsed correctly by literally every CSV parser.

adzm · on Aug 19, 2021

Pretty much the rfc. https://datatracker.ietf.org/doc/html/rfc4180

Parsers are trickier if you want to be lenient, but exporters are dead simple.

masklinn · on Aug 19, 2021

> That sounds like the problem of badly formatted CSV

That’s what CSV is. That’s what happens when you ingest CSVs whose production you don’t control.

> If you [ignore everything people literally clamour for in these comments and praise csv for]

Yes i also like ponies.

rendall · on Aug 19, 2021

> "That’s what CSV is."

That's really not a serious argument against CSV. Since you paraphrase in a silly way, I can do it too! Your "argument" is "Badly formatted files exist, therefore CSV bad".

Everyone "against CSV" seems to be arguing against badly formatted CSV, and leaping to the conclusion that "CSV is just bad" without much more to say about it. I'm sorry that badly formatted CSV gave you a bad time, but the format is fine and gets its job done.

"It doesn't have x, y or z feature therefore no one should be using it ever" is kind of a dumb argument, honestly.

Dylan16807 · on Aug 21, 2021

> Your "argument" is "Badly formatted files exist, therefore CSV bad".

The argument is actually that the badly formatted CSV files have taken over, therefore CSV is bad. You can't reject them, so your import becomes unreliable.

anigbrowl · on Aug 18, 2021

Me, a naive idiot: CSV is simple I will write my own exporter because I am clever

Me, 20 minutes later: Heh that was easy I am a genius

Me, 21 minutes later: Unicode is ruining my life T_T

Don't get me wrong, I really like CSV because it's so primitive and works so well if you are disciplined about it. But it's easy to get something working on a small dataset and forget all the other possibilities only to faceplant as soon as you step outside your front door. In the case above my experience with dealing with CSV data from other people made me arrogant, when I should have just taken a few minutes to learn my way around a mature library.

crazygringo · on Aug 19, 2021

How does Unicode present a problem?

In UTF-8, the byte for a comma and a quote only exist as their characters. They don't exist as parts of multibyte sequences, by design.

If you have Unicode problems, then you have Unicode problems, but they wouldn't seem to be CSV problems...? Unless you're being incredibly sloppy in your programming and outputting double-byte UTF-16 strings surrounded by single-byte commas and quotes or something...?

jacobsenscott · on Aug 19, 2021

  name,position
  "Smith, John＂‚Manager

asdff · on Aug 19, 2021

https://www.gnu.org/software/gawk/manual/gawk.html#Splitting...

selfhoster11 · on Aug 19, 2021

Lots of edge cases that aren't always handled to spec on both sides of the import/export. That's my experience, at least.

woodrowbarlow · on Aug 18, 2021

how're,you,handling,quotes?'

astine · on Aug 18, 2021

If you're manually generating your own CSV files, you probably know what kind of data you are generating and consequently whether your data is going to contain commas. If commas and newlines don't exist in your data, then you can safely ignore quoting rules when generating CSV files. I know that I've generated CSVs in the past and rather than figuring out the correct way to quote the strings, I just removed any inconvenient characters without any loss to the data at all. Obviously this is not "correct" but you don't have to implement cases if you know they won't show up.

politician · on Aug 19, 2021

You use the right ANSI characters - the record and field separators (30,31) - and avoid hacks like comma, pipe, and newline.

SpicyLemonZest · on Aug 18, 2021

This is true, but a lot of data processing takes place in a context where frictionless export functionality is more important than a 100% guarantee of import compatibility. I'd rather ingest city = ",CHANGSHA,HUNAN" (real example!) than ingest nothing at all because my vendor doesn't have time to integrate a JSON serializer.

masklinn · on Aug 18, 2021

> I fought this problem for several days going from NetSuite to Snowflake via CSV.

Yeah if you see a CSV import feature without a billion knobs you know you're in for a world of hurt.

If you see a CSV import feature with a billion knobs, you're probably still in a world of hurt.

romwell · on Aug 18, 2021

> but there are no guarantees it will import correctly.

What do you mean, there are "no guarantees"? You are in charge! You know what data you're dumping, you can see if it imports well. You can tailor your use case.

That's not the same as getting a CSV from some dump, where you have limited (if any) control over the behavior.

masklinn · on Aug 18, 2021

> As I mentioned down-thread, I can generate a CSV with a couple of fprintf statements and a loop.

And usually generate garbage for anything but the most trivial case, which really nobody gives a shit about. That's the main reason why CSV absolutely sucks too, you have to waste month diagnosing the broken shit you're given to implement the workarounds necessary to deal with it.

> I definitely can't do that with .xlsx.

You probably can though. An xlsx file is just a bunch of XML files in a zip.

Sanzig · on Aug 18, 2021

Define "garbage." If I know what my data looks like, I can anticipate the edge cases ahead of time. Plenty of CSV exports work this way, they don't need to be general if the schema is already imposed by the system.

Have you ever worked in embedded systems? Writing XML files and then zipping them on a platform with 32 kilobytes of RAM would be hell. CSV is easy, I can write the file a line at a time through a lightweight microcontroller-friendly filesystem library like FatFS.

I know this is HN and we like to pretend we're all data scientists working on clusters with eleventy billion gigs of RAM, but us embedded systems folks exist too.

masklinn · on Aug 18, 2021

> Define "garbage."

Incorrect encoding, incorrect separators (record and field both), incorrect escaping / quoting, etc…

> If I know what my data looks like

If you control the entirety of the pipeline, the format you're using is basically irrelevant. You can pick whatever you want and call it however you want.

> Have you ever worked in embedded systems? Writing XML files and then zipping them on a platform with 32 kilobytes of RAM would be hell. CSV is easy, I can write the file a line at a time through a lightweight microcontroller-friendly filesystem library like FatFS.

You can pretty literally do that with XML and zip files: write the uncompressed data, keep track of the amount of data (for the bits which are not fixed-size), write the file header, done. You just need to keep track of your file sizes and offsets in order to write the central directory. And the reality's if you're replacing a CSV file the only dynamic part will be the one worksheet, everything else will be constant.

theamk · on Aug 19, 2021

> If you control the entirety of the pipeline, the format you're using is basically irrelevant.

I think you are missing the point -- you only need to know about generator to know about format.

Since the parent poster was talking embedded, here is one example: a data logger with tiny embedded records tuples: (elapsed-time, voltage, current). You need this to be readable in the widest variety of programs possible. What format do you use?

I think the answer is pretty clear: CSV. It is compatible with every programming language and spreadsheet out there, and in a pinch, you can even open it in text editor and manually examine the data.

Using something like XLSX here would be total craziness: it will make code significantly bigger, and it will severely decrease compatibility.

theamk · on Aug 19, 2021

I have a script which generates a CSV file using a bunch of print statements. The columns are hostnames and some numbers, so it is never going to contain commas or newlines. This will be perfectly valid CSV every time.

That’s why CSV is absolutely beautiful - there is a huge number of applications that people really care about, and their data is constrained enough that there is not need to care about CSV escaping and need for any third party libraries.

Creating XSLX file by hand is possible, but this will be a large amount of code and I wouldn’t include this in my script, it would need to be a separate library - which means build system support, learning the API etc...

pdonis · on Aug 18, 2021

> An xlsx file is just a bunch of XML files in a zip.

A bunch of XML files with opaque formats that MS constantly makes changes to to make its competitors have to keep chasing the format.

masklinn · on Aug 18, 2021

Even if it were true it wouldn't matter a whit to the production side of the format, which is what "produce CSVs using fprintf" is: excel can consume them all.

megous · on Aug 18, 2021

As someone who tried, Excel's handling of CSV files was the reason to abandon the idea and generate XLSX.

Libreoffice handles normal UTF-8 encoded, quoted value CSV files fine. Excel not so much.

masklinn · on Aug 18, 2021

Did you misread / misunderstand my comment somehow? Hint: my comment is not about generating CSV for excel to consume.

marcosdumay · on Aug 18, 2021

As soon as you open it in Excel, it's garbage anyway, since it will replace date-like items with nonsense, drop number digits, convert anything it can, reencode monetary unities, and so on.

If you don't open it in Excel, you can have as strict a parser as you want, just like any other format.

But neither is going anywhere anyway.

masklinn · on Aug 18, 2021

> If you don't open it in Excel, you can have as strict a parser as you want, just like any other format.

No, you can not. Because the CSV format is so fuzzy you can very easily parse incorrectly and end up with a valid parse full of garbage.

Trivially: incorrect separator, file happens to not contain that separator at all, you end up with a single column. That's a completely valid file, and might even make sense for the system. Also trivially: incorrect encoding, anything ascii-compatible will parse fine as iso-8859-*. Also trivially: incorrect quoting / escaping, might not break the parse, will likely corrupt the data (because you will not be stripping the quotes or applying the escapes and will store them instead).

It's like you people have never had to write ingestion pipelines for CSVs coming from randos.

romwell · on Aug 18, 2021

>It's like you people have never had to write ingestion pipelines for CSVs coming from randos.

That's because this is not what this thread is about.

The comment you're responding to is not about CSVs coming from "randos". It's for the case where that rando is you, so you can make sure the problems you mention don't happen on the generation side of CSVs.

ribosometronome · on Aug 18, 2021

Why is the solution to create an entirely new format rather than try to more rigidly enforce the a single CSV standard?

masklinn · on Aug 19, 2021

Because there is no way to perform that enforcement, becauae you say “CSV” and people understand “my old garbage” and you can’t fix that; and because millions of incorrect documents will yield a valid but nonsensical parse.

yrds96 · on Aug 18, 2021

People doesn't matter which format is, since it simple work in any spreadsheet software.

Yeah i can generate any file with a bunch of printf, but csv i dont have to read a specification, i its possible to read with a bunch of read without have to use a xml or xlsl library.

masklinn · on Aug 18, 2021

> Yeah i can generate any file with a bunch of printf, but csv i dont have to read a specification, i its possible to read with a bunch of read

The only thing funnier than producing broken CSV is consuming broken CSV.

slunk · on Aug 18, 2021

> That's the main reason why CSV absolutely sucks too [...]

Is it? I think you're absolutely right that naive points of view like the one you're responding to will lead to avoidable bugs, but I'm not so sure the problem is CSV so much as people who assume CSV is simple enough to parse or generate without using a library.

masklinn · on Aug 18, 2021

> I'm not so sure the problem is CSV so much as people who assume CSV is simple enough to parse or generate without using a library.

The simplicity of CSV is what tells people that they can parse and generate it without a library, and even more so that that's a feature of CSV. You just had to read the comments disagreeing with me to see exactly that.