Hacker News new | past | comments | ask | show | jobs | submit login
Miller CLI – Like Awk, sed, cut, join, and sort for CSV, TSV and JSON (github.com/johnkerl)
273 points by jabo on Aug 25, 2021 | hide | past | favorite | 66 comments



I always use sqlite3 for working with large CSV files:

    $ sqlite3
    $ .mode csv
    $ .import my.csv foo
    $ SELECT * FROM foo WHERE name = 'bar';
It reads the header in automatically for the field names and then stores all the values as strings.


Not bad. Here is the Windows equivalent with Excel separator:

    PS> gsudo choco install sqlite3

    PS > sqlite3
    sqlite> .mode csv
    sqlite> .separator ; \n
    sqlite> .import my.csv foo
    sqlite> SELECT * FROM foo WHERE name = 'bar';
For CSV that contain up to several 100k records native PowerShell solution is way faster and more practical:

    Get-Content my.csv | ConvertFrom-Csv | ? name -eq bar
FYI, I maintain miller choco package: https://community.chocolatey.org/packages/miller


Great tip. A simple utility script can be used for repeated use/customization: https://pastebin.com/qEkimk7W

Another advantage is that this can be executed to import as part of initialization and get to sqlite prompt with the table ready for sql execution:

    $ csv_to_table.sh sample-test.csv
Or it can be executed using heredoc for non-interactive execution as well:

    $ csv_to_table.sh sample-test.csv <<-CMD
    select * from sample_test limit 10;
    CMD
    ... output ...


This stack exchange answer makes that into a fairly simple bash alias: https://unix.stackexchange.com/questions/424555/how-to-inser...


I do the same. I wrote[0] a little wrapper around this workflow that does a few additional niceties: 1) normalizes column names and removes whitespace, 2) supports importing multiple csv files without importing the header line as a record.

[0]: https://github.com/psanford/csv2sqlite


holy moly, that's rad. definitely adding this to my toolbox, thanks!


duckdb is another good option as well.


[flagged]


If there is anything ugly here, it's your communication.

Aside from the fact that OP's example can run in a non-interactive mode, it makes absolutely no difference to the discussion.


You can't understand that there are people who know SQL and not R?


I don't think the comment refers to people, just to better ways of doing things. People learn.


In my experience people rarely learn anything after you insult them and call their work ugly.

If AP had a "better way of doing things" maybe they should have actually given us an example of what it looks like.

Right now, all I think about R is that snobby, gatekeeping assholes use it to do things I can already do easily in SQL.


If you have Windows, you have a tool that can do this already installed: PowerShell. (If you don't have Windows, PowerShell is free and open source [MIT license]. One of the first things I install on a new Linux or macOS box lately.)

The first example in PowerShell:

> Import-Csv example.csv | Sort-Object Color, Shape

To filter, use `Where-Object`:

> Import-Csv example.csv | Where-Object { $_.Color -eq 'purple' } | Sort-Object Color, Shape

Or using aliases:

> ipcsv example.csv | ? { $_.Color -eq 'purple' } | sort Color, Shape

JSON uses the same syntax -- just replace `Import-Csv` with `Get-Content` and `ConvertFrom-Json`:

> Get-Content example.json | ConvertFrom-Json | Select-Object Color -Unique

> purple

> red

> yellow

There's also `Group-Object` for aggregation.


Some other useful tools for csv files, in particular:

csvkit, a set of command line tools for manipulating csvs: https://csvkit.readthedocs.io/en/latest/

visdata, a quite terminal based csv explorer: https://www.visidata.org/


visdata is pretty mind-blowing...at least, that's how i felt when i saw the demo the first time: https://www.youtube.com/watch?v=N1CBDTgGtOU


I make a lot of use of Miller at work and it has been a godsend. People tend to underestimate tools like this, but I'm able to answer a lot of questions about our data that other people would take longer for or don't have the capacity to check. It's great as a first step and if anything more complex is needed we usually spin up a python script. So it's very much in the philosophy of small composable tools but for records.


Not exactly the same, but we wrote a library to easily load any delimited type of file and finds header (even if not first row). It also works to load JSON, Parquet, AVRO and loads it into a dataframe. Not CLI exactly, but pretty easy:

https://github.com/capitalone/dataprofiler

Can Install: pip install dataprofiler[ml] --user

How it works:

csv_data = Data('your_file.csv') # Load: delimited, JSON, Parquet, Avro

csv_data.data.head(10) # Get head

csv_data.data.sort_values(by='name', inplace=True) # Sort

Anyway, pretty interesting Miller CLI. I'm not sure how the header detection works, especially if the header isn't the first row (which is often the case)


Thanks. This looks very cool.


Is this (supposedly) better than Burntsushi toolset (e.g. XSV[1]) or just different ?

[1] https://github.com/BurntSushi/xsv


I love xsv, but its been 3 years since last release and the PRs are accumulating.. miller seems much more active.


The comparison between their ongoing C version [1] and the upcoming go version [2] is nice to see. The README talks a bit more the performance comparison and the C version's strength, but still.

[1] https://github.com/johnkerl/miller/tree/main/c

[2] https://github.com/johnkerl/miller/tree/main/go


See also my own foray into this space: https://tkatchev.bitbucket.io/

You might find it useful.


Thanks for tab. I have used it a bit in the pipeline of some of my projects. Like jq, I think tab is another good example of a DSL that allows tersely and quickly getting things done. Same as Perl I suppose, but cleaner and more functional. Raku is also great, but it adds big dependencies whereas tab is single binary which is always nice.

If you are taking feature request, string interpolation or a sprintf function would make report generation kind of things easier for me.


A single binary dependency may be a better fit for one project as you outline, but I would make the wider case for a common core language that provides a DSL toolkit with a recursive descent style Grammar capability. This helps to build critical mass across multiple DSLs to maintain and improve a secure code base. In turn, DSL authors get a more productive development environment. Raku embeds Grammar support for DSLs and will soon have AST introspection for "Slang" authors to reach in and build elements directly...


I'm still looking forward to a single binary for each of the following:

* NQP

* Rakudo

* Rakudo programs

Building on https://yakshavingcream.blogspot.com/2019/08/summer-in-revie...


See LNAV (https://lnav.org) for a "mini-ETL" CLI powertool with embedded SQLite; it's fantastic.


How does this compare to JQ?

Aside for that it appears to support more formats (CSV [which has partial JQ support], and TSV), of course.

I find it very helpful when people compare their new toy or service with what is already pretty well known.


Miller is not new (at least 6 years old is what I can gleam from the Github page).

It is somewhat similar to JQ, but specifically for record based formats and not documents.

The example page [0] has quite a few nice examples.

A recent snippet from my own usage, looking for duplicate ids in a large-ish csv.gz:

    mlr uniq -c -g id then filter '$count > 1'
    mlr --ijson --ocsv filter '$success' then stats1 -a mean,min,max,stddev -f duration

[0]: https://miller.readthedocs.io/en/latest/data-examples.html


I don't think Sed and Awk should be on this list. It's not really able to transform text into something else, is it?

Even the tutorial leans on sed for cleansing ahead of Miller. https://www.ict4g.net/adolfo/notes/data-analysis/miller-quic...



Just yesterday I was opining in the IRC about what it would be like to have a operating system that was build from the ground up that supports smarter data formats than just text.

The concept of everything as a file, has awesome benefits I believe, but that doesn't meant files need to be unstructured text.

There's a core problem, that with text languages (like AWK or structured regex (sed, etc) you end up both having to parse AND manipulate data. which is no fun and prone to errors.

Abstracting away all of that into codecs that all of coreutils could speak would be very cool to me.

The second issue is structured data vs unstructured text. CSVs, or other table based formats make sense sometimes, and sometimes you want to be able to query things more easily. JSON provides that. What's the synthesis of both I don't know... but maybe this is closer to heaven.

I'd like to _minimize_ the amount of parsing or text manipulation done in any program and be able to focus on extracting and inserting the data I need, like a rosetta stone that just handles everything for me while I work. I want to be able to do:

awk "{print $schoolName}" or awk "{print $4}" and it just works - json, text, CSV.


Maybe the most notable system that had "files as structured data" is the IBM S/360 lineage of mainframes.

> In the context of IBM mainframe computers in the S/360 line, a data set (IBM preferred) or dataset is a computer file having a record organization.

> Data sets are not unstructured streams of bytes, but rather are organized in various logical record[3] and block structures determined by the DSORG (data set organization), RECFM (record format), and other parameters

https://en.wikipedia.org/wiki/Data_set_(IBM_mainframe)

Remarkably S/360 predates UNIX by several years, so its definitely not a novel concept. Albeit apparently IBM systems cost order of magnitude more than PDPs of the time.

Of course there are also classic LISP machines where afaik pretty much everything was a SEXPR, but I have even less knowledge about the details of those.


You should check out PowerShell. It's based on exactly that concept. Everything is an object.

`Import-Csv` and `ConvertFrom-Json` return objects that you can manipulate using LINQ or PowerShell's `*-Object` cmdlets (`Select-Object`, `Where-Object`, `Group-Object`, etc.). `Get-Process` returns objects representing your system's processes, `Get-Service` returns objects for your system's services, and so on.

Being able to chain that in your pipeline is incredibly powerful. You can send data into and out of CSVs and various other providers in a single line. It's a huge productivity boost.


so, powershell?


Sounds like it. Another option might be nushell. https://www.nushell.sh/


You should take a look at the late TempleOS. It's not a serious contender but has a lot of interesting ideas in that direction


He is dead now. RIP


Similar, but using the ACTUAL awk, sed, join, sort tools you already have and know about: https://github.com/dkogan/vnlog/


I'm a bit surprised by the need of verbose "--icsv --ocsv"... Shouldn't it be trivial to see that the input is csv? (and in that case, the output could be csv by default).


> Shouldn't it be trivial to see that the input is csv?

There is no reliable way to infer csv files: CSV files do not have a magic number. There are all kinds of separators used and quoting rules differ widely.

It's super annoying if a tool works on a 1K line CSV file, but breaks down if I have a 3 line file because it can't infer the type.

I much prefer my tools not to be "90%-smart", but predictable.


I assumed GP was referring to the filename extension. Of course people put all sorts of nonsense into files and then call them ".csv", but it's not a bad heuristic to use that to guess the format as a default, and even guess that the output format should be the same, in the absence of other flags.


> I much prefer my tools not to be "90%-smart", but predictable.

I understand your preference, but please recognize that it is not universal. Some of us much prefer a super-simple tool that fails for some particular cases, while requiring special options to be completely general. Thus you can use the heuristic defaults interactively (where you'll notice the errors easily), and write scripts with the more explicit form.


Wouldn't it be wonderful if we actually used ASCII as it was designed?

    Oct   Dec   Hex   Char
    ----------------------------------------
    034   28    1C    FS  (file separator)
    035   29    1D    GS  (group separator)
    036   30    1E    RS  (record separator)
    037   31    1F    US  (unit separator)
https://ronaldduncan.wordpress.com/2009/10/31/text-file-form...


The problem with this is the same problem that CSV has to solve though - there's no escape character specified in ASCII so you can't have a unit that contains any of these 4 characters or else you'll break the parser.


In turn I get what you are saying, but in this case it is not a trivial problem. CSV files seem simple on the surface but there are all sorts of gotchas.

For example, there's plenty of variation between platforms/applications when it comes to just terminating a line. Are we using CR, LF, CR+LF, LF+CR, NL, RS, EOL? What do we do when the source file is produced by an app that uses one approach but doesn't care about the others (allows their occurrence)?

If those others should appear in the data would our "90%-smart" tool make the wrong determination on line termination for the whole file? would everything just break or would this tool churn along and wreck all the data? how long until you noticed?

By my estimation, the "90%-smart" tool would be about 30% dependable unless used only with a known source and format, meaning it wouldn't need to be smart in the first place.


> CSV files seem simple on the surface but there are all sorts of gotchas.

My point is that supporting "general CSV files" is useless. Restricting your tooling to "simple CSV files" is good. But my opinions are not very representative. I also think that it is perfectly acceptable for a shell script to fail badly when it encounters filenames with spaces.


> It's super annoying if a tool works on a 1K line CSV file, but breaks down if I have a 3 line file because it can't infer the type.

How can that be? Is it possible to have such an ambiguous file? I mean, if a file contains a single number on a single name it can be anything, but the interpretation is the same. Can you create a file that has different contents depending on whether it is interpreted as csv or tsv?


>Can you create a file that has different contents depending on whether it is interpreted as csv or tsv?

Easily!

    a\tb,c
Is "a\tb" and "c" as a csv, but "a" and "b,c" as a tsv!

In practice, if the first line contains only a tab or a comma it might be enough to infer that as the separator, but:

1. that would fail on single-column files (by misinterpreting them as multi-column if the unused separator appears) 2. that couldn't infer anything on files where both separators appear

So it would only be a 90% (or maybe 99%) solution.


If it's a single column file, you - the user - should know it and act accordingly. Yeah scripting usage on random input, II know, but in that case then you would specify the input type. But heuristic defaulting to the most common usage in case of doubt (that would be CSV in the CSV/TSV doubt) for the interactive use is the way to go. Or at least, it's what I would expect personally as a user.


Heuristics are annoying and data-dependent algo changes are dangerous.

Completely different example, but limits in Splunk aggregations - it means you can run your report on small data, but when you scale it up (to real production data sizes, maybe), then suddenly you get wrong numbers, and maybe results like "0 errors of type X" when the real answer is that there are >=1 errors. Because one of the aggregations used has a window size limit that it is silently applying. This stuff is dangerous.

What Splunk was doing for me would be the equivalent of an SQL join giving approximate answers when the data is too big.


I can see that argument, but I kinda don't agree?

The issue with the heuristic is that it can fail depending on the input, and the input can easily change in a way that kills the heuristic.

Say you run the tool on a file, and it detects csv input and all is well. Then you update the file, and now it includes a tab character in the first line and the heuristic detects it as a tsv and now it fails - or the heuristic now gives up, or whatever.

Sure you can "improve the heuristic", but you can still, always, have the data change in a way that it defeats it. You now need to either be careful with the data or _know_ that you should specify the format, without the tool telling you. Everything seems to work immediately, and then later it blows up. That's a problem akin to e.g. bash and filenames with spaces. Everything works, until someone has a space in a filename, and then you get told that you should have known to quote everything all along (the solution there, would be to abolish word splitting).

To coin a pithy phrase: When a tool is easy to misuse and a user misuses it, blame the tool, not the user.

Now, if I were writing this thing I would make the logic much simpler: Make it default to csv (or whichever format is more common). Now the way to break the "heuristic" is to give data in the wrong format. But if you use csv, you don't have to explicitly give the format and your data can't break the heuristic (unless it switches format, which you would know about).


Not only is it possible, as a couple of commenters have already shown, but, due to the many variants of CSV out there, it's possible to construct a CSV that has different contents depending on which dialect you tell your CSV reader to expect. I'll leave the actual construction as an exercise to the reader, but, it would work along the same lines as the ambiguous TSV/CSV files you've seen here already.


Sure. It is typical to represent currency amounts as

12,000

If it isn't quoted, then a csv will read the comma as a separator, while a TSV won't

amazing number of similar examples. I went for a real use-case rather than a theoretical possibility


If a text file has the same number of commas, tabs or semicolons on every line it most probably is (but, obviously, is not guaranteed to be) a CSV/TSV/SSV.

Defining every flavour of these is hardly possible with a simple command line so I would rather let the user to specify an entire configuration file for this. We probably need an entire CSV schema language.


That's already not true for Quoted entries that contain the separator. Which I think is a common CSV use case.

But I do agree you could have a heuristic. E.g. ends in .csv and contains a lot more commas/semicolons/tabs than you would expect in normal text in the first 1-5 lines.

You could still have the flag as a fallback when you need something that's completely reliable.


I like how `csvkit` does it; assumes CSV input by default, `-t` for TSV input, `-T` for TSV output. Given I run `csvformat -T` many times a week, I appreciate the brevity.


> Like Awk, sed, cut, join, and sort […]

So basically Perl? :)


I added a new feature to my sqlite-utils CLI tool in June which covers similar ground: it now lets you load CSV, TSV and JSON data into an in-memory SQLite database and execute SQL queries against it: https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/

Example usage:

    sqlite-utils memory example.csv "select * from t order by color, shape"
    # Defaults to outputting JSON, you can add
    # --csv or --tsv for those formats or
    # --table to output as a rendered table 
More docs here: https://sqlite-utils.datasette.io/en/stable/cli.html#queryin...


There's also rq (record query)[1] that also supports CSV and JSON but not TSV though. It's written in Rust.

[1] https://github.com/dflemstr/rq


I just prefer using R data-frames - the convenience of a full data-science language shell for getting statistical data out of CSV. One can also make gorgeous graphs using ggplot2.


Nice. Any reason why installation section is not prominent and on the top of the readme?

It's even in an non-intuitive place in the documentation.


I don't know about MillerCLI's portability, but RecordStream (https://github.com/benbernard/RecordStream) is my go to swiss army knife.


Got a csv or json you need to eff-with from the command line? It's Miller time, baby!


i recently came across jq, which i use for json parsing: https://stedolan.github.io/jq/


For windows users NimbleText is a great little utility for working with tabular data:-

https://nimbletext.com/


i recently came across `jq`, which i use for this purpose


A bit OT but there really need to be a new paradigm shift for a new OS interface that has less coupling and more flexibility compared to piping.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: