I do the same. I wrote[0] a little wrapper around this workflow that does a few additional niceties: 1) normalizes column names and removes whitespace, 2) supports importing multiple csv files without importing the header line as a record.
If you have Windows, you have a tool that can do this already installed: PowerShell. (If you don't have Windows, PowerShell is free and open source [MIT license]. One of the first things I install on a new Linux or macOS box lately.)
I make a lot of use of Miller at work and it has been a godsend. People tend to underestimate tools like this, but I'm able to answer a lot of questions about our data that other people would take longer for or don't have the capacity to check. It's great as a first step and if anything more complex is needed we usually spin up a python script. So it's very much in the philosophy of small composable tools but for records.
Not exactly the same, but we wrote a library to easily load any delimited type of file and finds header (even if not first row). It also works to load JSON, Parquet, AVRO and loads it into a dataframe. Not CLI exactly, but pretty easy:
Anyway, pretty interesting Miller CLI. I'm not sure how the header detection works, especially if the header isn't the first row (which is often the case)
The comparison between their ongoing C version [1] and the upcoming go version [2] is nice to see. The README talks a bit more the performance comparison and the C version's strength, but still.
Thanks for tab. I have used it a bit in the pipeline of some of my projects. Like jq, I think tab is another good example of a DSL that allows tersely and quickly getting things done. Same as Perl I suppose, but cleaner and more functional. Raku is also great, but it adds big dependencies whereas tab is single binary which is always nice.
If you are taking feature request, string interpolation or a sprintf function would make report generation kind of things easier for me.
A single binary dependency may be a better fit for one project as you outline, but I would make the wider case for a common core language that provides a DSL toolkit with a recursive descent style Grammar capability. This helps to build critical mass across multiple DSLs to maintain and improve a secure code base. In turn, DSL authors get a more productive development environment. Raku embeds Grammar support for DSLs and will soon have AST introspection for "Slang" authors to reach in and build elements directly...
Just yesterday I was opining in the IRC about what it would be like to have a operating system that was build from the ground up that supports smarter data formats than just text.
The concept of everything as a file, has awesome benefits I believe, but that doesn't meant files need to be unstructured text.
There's a core problem, that with text languages (like AWK or structured regex (sed, etc) you end up both having to parse AND manipulate data. which is no fun and prone to errors.
Abstracting away all of that into codecs that all of coreutils could speak would be very cool to me.
The second issue is structured data vs unstructured text. CSVs, or other table based formats make sense sometimes, and sometimes you want to be able to query things more easily. JSON provides that. What's the synthesis of both I don't know... but maybe this is closer to heaven.
I'd like to _minimize_ the amount of parsing or text manipulation done in any program and be able to focus on extracting and inserting the data I need, like a rosetta stone that just handles everything for me while I work. I want to be able to do:
awk "{print $schoolName}" or awk "{print $4}" and it just works - json, text, CSV.
Maybe the most notable system that had "files as structured data" is the IBM S/360 lineage of mainframes.
> In the context of IBM mainframe computers in the S/360 line, a data set (IBM preferred) or dataset is a computer file having a record organization.
> Data sets are not unstructured streams of bytes, but rather are organized in various logical record[3] and block structures determined by the DSORG (data set organization), RECFM (record format), and other parameters
Remarkably S/360 predates UNIX by several years, so its definitely not a novel concept. Albeit apparently IBM systems cost order of magnitude more than PDPs of the time.
Of course there are also classic LISP machines where afaik pretty much everything was a SEXPR, but I have even less knowledge about the details of those.
You should check out PowerShell. It's based on exactly that concept. Everything is an object.
`Import-Csv` and `ConvertFrom-Json` return objects that you can manipulate using LINQ or PowerShell's `*-Object` cmdlets (`Select-Object`, `Where-Object`, `Group-Object`, etc.). `Get-Process` returns objects representing your system's processes, `Get-Service` returns objects for your system's services, and so on.
Being able to chain that in your pipeline is incredibly powerful. You can send data into and out of CSVs and various other providers in a single line. It's a huge productivity boost.
I'm a bit surprised by the need of verbose "--icsv --ocsv"... Shouldn't it be trivial to see that the input is csv? (and in that case, the output could be csv by default).
> Shouldn't it be trivial to see that the input is csv?
There is no reliable way to infer csv files: CSV files do not have a magic number. There are all kinds of separators used and quoting rules differ widely.
It's super annoying if a tool works on a 1K line CSV file, but breaks down if I have a 3 line file because it can't infer the type.
I much prefer my tools not to be "90%-smart", but predictable.
I assumed GP was referring to the filename extension. Of course people put all sorts of nonsense into files and then call them ".csv", but it's not a bad heuristic to use that to guess the format as a default, and even guess that the output format should be the same, in the absence of other flags.
> I much prefer my tools not to be "90%-smart", but predictable.
I understand your preference, but please recognize that it is not universal. Some of us much prefer a super-simple tool that fails for some particular cases, while requiring special options to be completely general. Thus you can use the heuristic defaults interactively (where you'll notice the errors easily), and write scripts with the more explicit form.
The problem with this is the same problem that CSV has to solve though - there's no escape character specified in ASCII so you can't have a unit that contains any of these 4 characters or else you'll break the parser.
In turn I get what you are saying, but in this case it is not a trivial problem. CSV files seem simple on the surface but there are all sorts of gotchas.
For example, there's plenty of variation between platforms/applications when it comes to just terminating a line. Are we using CR, LF, CR+LF, LF+CR, NL, RS, EOL? What do we do when the source file is produced by an app that uses one approach but doesn't care about the others (allows their occurrence)?
If those others should appear in the data would our "90%-smart" tool make the wrong determination on line termination for the whole file? would everything just break or would this tool churn along and wreck all the data? how long until you noticed?
By my estimation, the "90%-smart" tool would be about 30% dependable unless used only with a known source and format, meaning it wouldn't need to be smart in the first place.
> CSV files seem simple on the surface but there are all sorts of gotchas.
My point is that supporting "general CSV files" is useless. Restricting your tooling to "simple CSV files" is good. But my opinions are not very representative. I also think that it is perfectly acceptable for a shell script to fail badly when it encounters filenames with spaces.
> It's super annoying if a tool works on a 1K line CSV file, but breaks down if I have a 3 line file because it can't infer the type.
How can that be? Is it possible to have such an ambiguous file? I mean, if a file contains a single number on a single name it can be anything, but the interpretation is the same. Can you create a file that has different contents depending on whether it is interpreted as csv or tsv?
>Can you create a file that has different contents depending on whether it is interpreted as csv or tsv?
Easily!
a\tb,c
Is "a\tb" and "c" as a csv, but "a" and "b,c" as a tsv!
In practice, if the first line contains only a tab or a comma it might be enough to infer that as the separator, but:
1. that would fail on single-column files (by misinterpreting them as multi-column if the unused separator appears)
2. that couldn't infer anything on files where both separators appear
So it would only be a 90% (or maybe 99%) solution.
If it's a single column file, you - the user - should know it and act accordingly. Yeah scripting usage on random input, II know, but in that case then you would specify the input type. But heuristic defaulting to the most common usage in case of doubt (that would be CSV in the CSV/TSV doubt) for the interactive use is the way to go. Or at least, it's what I would expect personally as a user.
Heuristics are annoying and data-dependent algo changes are dangerous.
Completely different example, but limits in Splunk aggregations - it means you can run your report on small data, but when you scale it up (to real production data sizes, maybe), then suddenly you get wrong numbers, and maybe results like "0 errors of type X" when the real answer is that there are >=1 errors. Because one of the aggregations used has a window size limit that it is silently applying. This stuff is dangerous.
What Splunk was doing for me would be the equivalent of an SQL join giving approximate answers when the data is too big.
The issue with the heuristic is that it can fail depending on the input, and the input can easily change in a way that kills the heuristic.
Say you run the tool on a file, and it detects csv input and all is well. Then you update the file, and now it includes a tab character in the first line and the heuristic detects it as a tsv and now it fails - or the heuristic now gives up, or whatever.
Sure you can "improve the heuristic", but you can still, always, have the data change in a way that it defeats it. You now need to either be careful with the data or _know_ that you should specify the format, without the tool telling you. Everything seems to work immediately, and then later it blows up. That's a problem akin to e.g. bash and filenames with spaces. Everything works, until someone has a space in a filename, and then you get told that you should have known to quote everything all along (the solution there, would be to abolish word splitting).
To coin a pithy phrase: When a tool is easy to misuse and a user misuses it, blame the tool, not the user.
Now, if I were writing this thing I would make the logic much simpler: Make it default to csv (or whichever format is more common). Now the way to break the "heuristic" is to give data in the wrong format. But if you use csv, you don't have to explicitly give the format and your data can't break the heuristic (unless it switches format, which you would know about).
Not only is it possible, as a couple of commenters have already shown, but, due to the many variants of CSV out there, it's possible to construct a CSV that has different contents depending on which dialect you tell your CSV reader to expect. I'll leave the actual construction as an exercise to the reader, but, it would work along the same lines as the ambiguous TSV/CSV files you've seen here already.
If a text file has the same number of commas, tabs or semicolons on every line it most probably is (but, obviously, is not guaranteed to be) a CSV/TSV/SSV.
Defining every flavour of these is hardly possible with a simple command line so I would rather let the user to specify an entire configuration file for this. We probably need an entire CSV schema language.
That's already not true for Quoted entries that contain the separator. Which I think is a common CSV use case.
But I do agree you could have a heuristic. E.g. ends in .csv and contains a lot more commas/semicolons/tabs than you would expect in normal text in the first 1-5 lines.
You could still have the flag as a fallback when you need something that's completely reliable.
I like how `csvkit` does it; assumes CSV input by default, `-t` for TSV input, `-T` for TSV output. Given I run `csvformat -T` many times a week, I appreciate the brevity.
I added a new feature to my sqlite-utils CLI tool in June which covers similar ground: it now lets you load CSV, TSV and JSON data into an in-memory SQLite database and execute SQL queries against it: https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/
Example usage:
sqlite-utils memory example.csv "select * from t order by color, shape"
# Defaults to outputting JSON, you can add
# --csv or --tsv for those formats or
# --table to output as a rendered table
I just prefer using R data-frames - the convenience of a full data-science language shell for getting statistical data out of CSV. One can also make gorgeous graphs using ggplot2.