I'm a bit surprised by the need of verbose "--icsv --ocsv"... Shouldn't it be tr...

heinrichhartman · on Aug 25, 2021

> Shouldn't it be trivial to see that the input is csv?

There is no reliable way to infer csv files: CSV files do not have a magic number. There are all kinds of separators used and quoting rules differ widely.

It's super annoying if a tool works on a 1K line CSV file, but breaks down if I have a 3 line file because it can't infer the type.

I much prefer my tools not to be "90%-smart", but predictable.

Y_Y · on Aug 25, 2021

I assumed GP was referring to the filename extension. Of course people put all sorts of nonsense into files and then call them ".csv", but it's not a bad heuristic to use that to guess the format as a default, and even guess that the output format should be the same, in the absence of other flags.

enriquto · on Aug 25, 2021

> I much prefer my tools not to be "90%-smart", but predictable.

I understand your preference, but please recognize that it is not universal. Some of us much prefer a super-simple tool that fails for some particular cases, while requiring special options to be completely general. Thus you can use the heuristic defaults interactively (where you'll notice the errors easily), and write scripts with the more explicit form.

chasil · on Aug 25, 2021

Wouldn't it be wonderful if we actually used ASCII as it was designed?

    Oct   Dec   Hex   Char
    ----------------------------------------
    034   28    1C    FS  (file separator)
    035   29    1D    GS  (group separator)
    036   30    1E    RS  (record separator)
    037   31    1F    US  (unit separator)

https://ronaldduncan.wordpress.com/2009/10/31/text-file-form...

svieira · on Aug 25, 2021

The problem with this is the same problem that CSV has to solve though - there's no escape character specified in ASCII so you can't have a unit that contains any of these 4 characters or else you'll break the parser.

memetomancer · on Aug 25, 2021

In turn I get what you are saying, but in this case it is not a trivial problem. CSV files seem simple on the surface but there are all sorts of gotchas.

For example, there's plenty of variation between platforms/applications when it comes to just terminating a line. Are we using CR, LF, CR+LF, LF+CR, NL, RS, EOL? What do we do when the source file is produced by an app that uses one approach but doesn't care about the others (allows their occurrence)?

If those others should appear in the data would our "90%-smart" tool make the wrong determination on line termination for the whole file? would everything just break or would this tool churn along and wreck all the data? how long until you noticed?

By my estimation, the "90%-smart" tool would be about 30% dependable unless used only with a known source and format, meaning it wouldn't need to be smart in the first place.

enriquto · on Aug 25, 2021

> CSV files seem simple on the surface but there are all sorts of gotchas.

My point is that supporting "general CSV files" is useless. Restricting your tooling to "simple CSV files" is good. But my opinions are not very representative. I also think that it is perfectly acceptable for a shell script to fail badly when it encounters filenames with spaces.

enriquto · on Aug 25, 2021

> It's super annoying if a tool works on a 1K line CSV file, but breaks down if I have a 3 line file because it can't infer the type.

How can that be? Is it possible to have such an ambiguous file? I mean, if a file contains a single number on a single name it can be anything, but the interpretation is the same. Can you create a file that has different contents depending on whether it is interpreted as csv or tsv?

faho · on Aug 25, 2021

>Can you create a file that has different contents depending on whether it is interpreted as csv or tsv?

Easily!

    a\tb,c

Is "a\tb" and "c" as a csv, but "a" and "b,c" as a tsv!

In practice, if the first line contains only a tab or a comma it might be enough to infer that as the separator, but:

1. that would fail on single-column files (by misinterpreting them as multi-column if the unused separator appears) 2. that couldn't infer anything on files where both separators appear

So it would only be a 90% (or maybe 99%) solution.

darkwater · on Aug 25, 2021

If it's a single column file, you - the user - should know it and act accordingly. Yeah scripting usage on random input, II know, but in that case then you would specify the input type. But heuristic defaulting to the most common usage in case of doubt (that would be CSV in the CSV/TSV doubt) for the interactive use is the way to go. Or at least, it's what I would expect personally as a user.

kzrdude · on Aug 25, 2021

Heuristics are annoying and data-dependent algo changes are dangerous.

Completely different example, but limits in Splunk aggregations - it means you can run your report on small data, but when you scale it up (to real production data sizes, maybe), then suddenly you get wrong numbers, and maybe results like "0 errors of type X" when the real answer is that there are >=1 errors. Because one of the aggregations used has a window size limit that it is silently applying. This stuff is dangerous.

What Splunk was doing for me would be the equivalent of an SQL join giving approximate answers when the data is too big.

faho · on Aug 25, 2021

I can see that argument, but I kinda don't agree?

The issue with the heuristic is that it can fail depending on the input, and the input can easily change in a way that kills the heuristic.

Say you run the tool on a file, and it detects csv input and all is well. Then you update the file, and now it includes a tab character in the first line and the heuristic detects it as a tsv and now it fails - or the heuristic now gives up, or whatever.

Sure you can "improve the heuristic", but you can still, always, have the data change in a way that it defeats it. You now need to either be careful with the data or _know_ that you should specify the format, without the tool telling you. Everything seems to work immediately, and then later it blows up. That's a problem akin to e.g. bash and filenames with spaces. Everything works, until someone has a space in a filename, and then you get told that you should have known to quote everything all along (the solution there, would be to abolish word splitting).

To coin a pithy phrase: When a tool is easy to misuse and a user misuses it, blame the tool, not the user.

Now, if I were writing this thing I would make the logic much simpler: Make it default to csv (or whichever format is more common). Now the way to break the "heuristic" is to give data in the wrong format. But if you use csv, you don't have to explicitly give the format and your data can't break the heuristic (unless it switches format, which you would know about).

actually_a_dog · on Aug 25, 2021

Not only is it possible, as a couple of commenters have already shown, but, due to the many variants of CSV out there, it's possible to construct a CSV that has different contents depending on which dialect you tell your CSV reader to expect. I'll leave the actual construction as an exercise to the reader, but, it would work along the same lines as the ambiguous TSV/CSV files you've seen here already.

mellavora · on Aug 25, 2021

Sure. It is typical to represent currency amounts as

12,000

If it isn't quoted, then a csv will read the comma as a separator, while a TSV won't

amazing number of similar examples. I went for a real use-case rather than a theoretical possibility

qwerty456127 · on Aug 25, 2021

If a text file has the same number of commas, tabs or semicolons on every line it most probably is (but, obviously, is not guaranteed to be) a CSV/TSV/SSV.

Defining every flavour of these is hardly possible with a simple command line so I would rather let the user to specify an entire configuration file for this. We probably need an entire CSV schema language.

carlmr · on Aug 25, 2021

That's already not true for Quoted entries that contain the separator. Which I think is a common CSV use case.

But I do agree you could have a heuristic. E.g. ends in .csv and contains a lot more commas/semicolons/tabs than you would expect in normal text in the first 1-5 lines.

You could still have the flag as a fallback when you need something that's completely reliable.

zimpenfish · on Aug 25, 2021

I like how `csvkit` does it; assumes CSV input by default, `-t` for TSV input, `-T` for TSV output. Given I run `csvformat -T` many times a week, I appreciate the brevity.