Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Doing a database join with CSV files (johndcook.com)
357 points by chmaynard on Dec 31, 2019 | hide | past | favorite | 111 comments


  sqlite> .mode csv
  sqlite> .header on
  sqlite> .import weight.csv weight
  sqlite> .import person.csv person
  sqlite> select * from person, weight where person.ID = weight.ID;
  ID,sex,ID,weight
  123,M,123,200
  789,F,789,155
  sqlite>


PostgreSQL can do it on live data without importing anything. Set up the "file_fwd" extension:

  CREATE EXTENSION file_fdw;
  CREATE SERVER fdw FOREIGN DATA WRAPPER file_fdw;
Declare tables:

  CREATE FOREIGN TABLE person (id int, sex text)
    SERVER fdw
    OPTIONS (filename '/tmp/person.csv', format 'csv');
  CREATE FOREIGN TABLE weight (id int, weight int)
    SERVER fdw
    OPTIONS (filename '/tmp/weight.csv', format 'csv');
Now:

  SELECT * FROM person JOIN weight ON weight.id = person.id;


In R this is rather straightforward, and you can also achieve it without additional libraries:

  weight <- read.csv("weight.csv")
  person <- read.csv("person.csv")
  merge(weight, person, all=TRUE)
Of course, nowadays you would use data.table, but still the merging logic would be exactly the same.


That loads all the data into memory, though. Fine if you're joining all of it, but with the Postgres example you can throw the whole SQL language at the data, and it will be streamed and pipelined efficiently just like with normal Postgres tables.


I never got deep into R as we had python or R as options in school for data projects so forgive me but where is the database join in this scenario?


Original article doesn't say anything about "database join". It's about joining two datasets by some common ID.

R in this case fits the bill and even allows for some relational algebra here (e.g. INNER JOIN would be merge(X, Y, all=FALSE), LEFT OUTER JOIN: merge(X, Y, all.x=TRUE), etc...)


Merge joins the two dataset on identical columns. All=True is an outer join.


This isn't "without importing anything". This is "importing all the data again every time you run a query".


No, "import" means "create state from data". It's not creating any persistent state at all.

But it's an important nuance for another reason, because you can operate on the original files in real time. You can update the underlying files and your queries will reflect the changes.

The downside is since the data is streamed at query time, it's less efficient and doesn't allow indexes to be created.


That's reasonable.


I can't speak for XSV but when I had to do similar analysis of logfiles in range of hundreds of thousands of records - SQLite would consistently seize up on my machine and make it impossible to do analysis. Had i know about XSV I may have tried it, but GNU Join saved my bacon that week and would do my analysis in minutes.


That's why you can `EXPLAIN` your query, and maybe even add indices, to speed things up!

Plus in the last few years SQLite has gotten a lot more powerful, and has added lots of JSON support and has improved index usage. I'd avoid working with lots of data as flat ASCII text, because of all the I/O and wasted cycles to read the data and insert it into some data structure and then write it out as flat ASCII text, but it can be super debuggable.


> GNU Join

The join utility is actually a part of POSIX[1], so every UNIX should have one. Here's one from OpenBSD[2], for example. The GNU version probably has more flags though.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/j...

[2] https://man.openbsd.org/join.1


Plus POSIX join requires pre-sorted inputs.

BTW, there's never a bad time to mention http://johnkerl.org/miller/doc/ ...


Yeah my colleague went down the rabbit hole and swore never again. Good luck with carriage return bugs!


Did you create indices on the columns you wanted to join?


I was trying to join two 300M lines CSV files using SQLite and Postgres and it took days to merge them on c4.xlarge.

Later, I tried Miller and it took less than 4-6 hours.

It also takes some time to tune Postgres to make it faster for this particular task.


Seems like you were doing something naively. Probably would have been much faster if you'd written a simple conversion script to do batch inserts.

Perhaps something pathological like commiting (or flushing to disk) after every row?


4-6 hours still seems like a long time for this, did you have sorted indices on the join columns? How high of cardinality (distinct counts) did the join columns have? Traditional RDBMS are pretty fast at joining when you have the right indices in place, especially when the cardinality of the join columns isn't very high.

And how much of that time was just getting the data into the tables? There are fast and slow ways to do that, too...


IIRC, it took me 4-7 hours to load 600GB of uncompressed data to the PostgreSQL after disabling WAL and tunning some config variables. I didn't have time to test all possible speed improvements.

After import, it takes 3-6 hours additionally to create indexes for those tables.

I think the main problem was that I had text indices. I'm not an expert in RDBMS and used a simple join that was taking ages to start producing the actual data. There are definitely a lot of ways to tune such queries and PostgreSQL configs, but I wanted a simple and universal solution.


On average each line was 1GB of data!? Can you give a rough description of the content of the two files?


Not GP and nothing to do with them, but I have an example of a CSV format which yields absurdly long lines.

We have a CSV for timeseries data, where for some reason someone decided to force each line to represent one data point. However, for some cases, one data point may contain around 30 different statistics for 4 million different IPs, which get represented as 120 million columns in the CSV (think a CSV header like 'timestamp,10.0.0.1—Throughput,10.0.0.1-DroppedBytes,[...],10.215.188.251-Throughput,[...]'). With large numbers represented as text in a CSV, this can sometimes reach more than one 1GB per line.


600G / 300M = 2K


Woof, yeah, bad arithmetic by me. Thanks!


When I've had to do similar, the easiest and fastest way has always been install spark -> Create an external spark sql table from the csvs -> use spark sql to join and aggregate/write. If you haven't used spark before it sounds complicated but its actually really quick and easy.

In general the operation on my laptop can get up to 200 MB/s, it's basically IO limited to the SSD of your machine.


Would be nice if you could write a small article about that.


Similarly, I've used Prestoql for this. It's similar to Spark and Hive.


This is a perfect candidate for XSV. You might want to try it. I'd suggest do one try after creating an index using XSV and the second one without.

I feel you will be impressed with the performance.


This would take seconds to minutes with BigQuery or Athena.


It would still take a long time to actually send all that data to BigQuery


Since the OP noted that they were trying to accomplish this task on a c4.xlarge machine, the data is already "in the cloud" and so transfer should be blazingly fast.

Using the zipped CSV at [1] as a size estimate, I'd ballpark 300 million CSV rows as somewhere between 6.7-10.6 GB:

    (300000000/500000) * 11.10 = 6.7 GB
    (300000000/500000) * 17.68 = 10.6 GB
That could one minute to transfer. (Obviously if the rows contain a ton of data, these size estimates could be off, in an unbounded way.)

1 = http://eforexcel.com/wp/downloads-18-sample-csv-files-data-s...


What is Miller ?


In PostgreSQL we could do that in parallel:

https://www.postgresql.org/message-id/CA+hUKGKzn3mjCAp=TDrji...


You might want to save the query result to a file too:

    .once out.csv
    select * from person, weight where person.ID = weight.ID;
    .mode column


This converts everything to "text" type though and makes many queries jacked up.


Cool - you could create a bash alias called xsv for that.


You should probably create an index on the tables first and save the result into a temporary table.


Not that I want to be the "well actually" guy...but MS has a much underrated tool called Log Parser that's been around for years that lets you do this kinda thing and more.

If you're a Windows admin and haven't bumped into this utility then it's well worth a look.

edit: forgot to mention that it can also scripted from VBScript and Powershell for added fun.

https://www.microsoft.com/en-gb/download/details.aspx?id=246...


Seconding for log parser.

There's even a UI there : https://techcommunity.microsoft.com/t5/exchange-team-blog/in...


The `join` command does a similar job and is included by default on at least Mac OS X (10.14) and Ubuntu 16.04.

I think it's much more common than that implies, but have not looked beyond two machines I had immediate access to.

The interface is clumsier in some ways, but it's already there, which is often a win when writing scripts.


xsv author here.

There is very little overlap between what xsv does and what standard Unix tools like `join` do. Chances are, if you're using xsv for something like this, then you probably can't correctly use `join` to do it because `join` does not understand the CSV format.

If your CSV data happen to fall into the subset of the CSV format that does not include escaped field separators (or record separators), then a tool like `join` could work. Notably, this might include TSV (tab separated files) or files that use the ASCII field/record separators (although I have literally never seen such a file in the wild). But if it's plain old comma separated values, then using a tool like `join` is perilous.

I didn't write xsv out of ignorance of standard line oriented tools. I wrote it specifically to target problems that cannot be solved by standard line oriented tools.

You might also argue that data should not be formatted in such a way, and philosophically, I don't necessarily disagree with you. But xsv is not a philosophical tool. It is a practical tool to deal with the actual data you have.


In case folks don't know, burntsushi is also the author of ripgrep (rg), written in Rust, and possibly the fastest grep in the world at the moment.

rg has all but replaced grep for me.


>burntsushi is also the author of ripgrep

They are also the source of much highly practical domain-specific knowledge and advice on how to write Rust code that does e.g.; efficient text file reading, streaming, state machines, etc.

Many thanks for these contributions.


And that advice applies further than just Rust, too. Highly recommend checking their blog and work out!


@burntsushi ... Sir .. need to say this.. many thanks for your tools! They make my life so clean and so manageable. Data wrangling on the command line with your tools is such a great experience (minimalistic and efficient).

Is there anything for `json` files that you would recommend? `jq` is awesome but just wondering.


Thanks! And no, I just use `jq`. I do often have to consult its man page for its DSL syntax, but probably because I don't use it frequently enough.


Thanks for the info. That is definitely good to have spelled out.

I don't believe I ever said that xsv shouldn't exist, or has no legitimate purpose, or was written out of ignorance - just that join is useful for similar tasks and worth knowing about, particularly because it's usually available without an install or compile.

You've obviously done a lot more CSV wrangling than I have, and the info here on what exactly xsv does that join cannot is helpful.

Thanks for sharing!


Xsv is excellent. It’s included in this list of “awesome csv” tools I’ve been collating, because I am inordinately fond of csv: https://github.com/secretGeek/AwesomeCSV/blob/master/README....


The join command has been around for over 30 years!

See p. 7 of this PDF (which contains SunOS 4.1.2 manual pages): http://www.bitsavers.org/pdf/sun/sunos/4.1.2/800-6641-10_Ref...

The short description it gives for the command is "relational database operator".

(The "Last change: 4 March 1988" at the bottom of the page refers to the last change of the intro(1) manual page, not the whole book.)


The problem most of the times is correctly parsing the CSV files.


This command is even covered in LPIC-1.


If your goal is to join large CSV files using a local program, the ideal tool is not SQLite but DuckDB:

https://www.duckdb.org/docs/current/sql/copy.html

The reason why DuckDB is a better fit for this job is because DuckDB is a column store and has a block-oriented vectorized execution engine. This approach is orders of magnitude faster when you’re doing batch operations with millions of rows at a time.

In contrast, SQLite would be orders of magnitude faster that DuckDB when you’re operating on one row at a time.


I've used SQLite quite a few times, but never heard of DuckDB. Can anybody provide some more information about it?


Main author of DuckDB here, I did not expect to see this mentioned here. DuckDB is a relational DBMS geared towards efficiently handling large analytical-style workloads locally. It is similar to SQLite in the sense that it operates locally on your machine, is easy to run and install and has zero dependencies. However, DuckDB uses modern processing paradigms (vectorized processing, columnar storage) that make it much faster when processing large amounts of data.

It's still in an early stage currently, however, most of the functionality is there (full SQL support, permanent storage, ACID properties). Feel free to give it a try if you are interested. DuckDB has Python and R bindings, and a shell based off of the sqlite3 shell. You can find installation instructions here: https://www.duckdb.org/docs/current/tutorials/installation.h...


I wonder what niche these tools fill. Maybe when you want to do some ad-hoc analysis and want every step in your shell history?

Otherwise it seems more flexible to just fire up a python interpreter and do it in like 3 lines of pandas (or with sqlite and .import like another commenter mentioned)


I’ve never had to use joins, but I’ve used shell scripts in CI environments where I can’t install stuff. For example, I had a Jekyll site generated from a project in some Linux ruby container. I couldn’t install python (or this xsv tool) but still wanted to summarize some of the data for display in one of the pages. While I would certainly rather use python, usinf shell was easier than convincing the admin of that server to install python into the ruby image, or trying to learn ruby and however they do simple data counts.

This tool is weird because if I’m going to download some tool that’s not in the distro, I’d rather use SQLite or try to figure it out with awk.


The shell is a very powerful REPL for incremental development, so you can often do things more rapidly by looking with `xsv headers`, then joining with `xsv join | xsv select | head | xsv table` than using python or sqlite.

I definitely used to use `sqlite` for every CSV before and I've shifted to using `xsv`. If I want to do any heavy lifting I'm going to pop it into PostgreSQL anyway.

Crucial feature is that it's easy to pop some `xsv` pipeline in the middle of your `for i in ...` or `find . -print0 | xargs -0`.


For me it's ad-hoc analysis on large CSV files. Large meaning well beyond what Excel would be capable of, often larger than fits into memory on my local machine (10s of GiB).

Sometimes I also use xsv to just do a step of the analysis and dive deeper on some subset using pandas.

In my experience both SQLite and Pandas aren't as fast as fast for large files. So they are not really good options.

Pandas is especially bad because it uses a column oriented data structure internally so reading from or writing to CSV is incredibly slow in Pandas. If you can use parquet that's not a problem but unfortunately parquet is not nearly is ubiquitous as csv :(


If Pandas is slow, than you can use Spark. For such big files laptop is not an option anyway. SQLite can be fast if you index your data (but I've worked with files < 10G). Nowadays I am just uploading CSV to some cloud database and work with data there.


> For such big files laptop is not an option anyway

Too big for excel is not big data, and my laptop can load this 10G in RAM (not that it necessarily need all of it) so why not if the data is here and the laptop on your lap ?


I had to do this type of work frequently at an old job. We'd built short-run software systems for clients.

Often this resulted in one-off data loads or reporting metrics. It was almost always easier just to knock something out with command line tools than it was to write, test, and deploy code. The deployment process itself could take longer than actually processing the file.


The same niche grep, sed, cut, sort, etc.. fill but specifically for csv and, using them together (xsv with other unix tools) can get you quickly very far on some tasks.


+1 for python panda. Amazing tool to manipulate data and csv very easily. Lots of options!


For Windows, Microsoft offers a CSV ODBC driver that can make the files look like any other relational database.

https://support.microsoft.com/en-us/help/850320/creating-an-...


Oracle has a feature called external tables that lets you map regular files to the database and run SQL against them. Works with csv or any format that can be loaded to the database with the normal database loading utilities. This is very useful for getting transformation and loading done in one step!


MySQL has a barebones CSV storage engine: https://dev.mysql.com/doc/refman/8.0/en/se-csv-limitations.h...

More usefully, MySQL has LOAD DATA INFILE facilities for bulk loading of flat files.


The feature actually uses the syntax from the old SQLLDR tool, and can also specify fields in absolute columnar position as well as defining delimiters (reminiscent of gawk). There are also conditionals (when), rounding rules, and other transformations.


A fantastic feature - I've used it regularly in the past to load large log files into the database with incredible performance.


Postgres also has this feature (via file_fwd)


I can't speak to their odbc drivers, but after trying to use their oledb drivers to query excel spreadsheets:

Abandon all hope, all ye who enter here.


I have used the odbc drivers for csv previously. I agree with your assessment, never again. They are unforgiving in what they except and not a robust solution at all. I spent months cleaning up edge cases the driver should have been able to handle.


The built-in tooling for CSV in PowerShell is excellent (provided you don't care about details like if numbers are quoted or not) - I've been using that for csv-related ETL tasks.


It seems most systems that support multiple database drivers include or allow to be built drivers for delimited text. DBI, JDBC...


`xsv` can do far more than just join and select, such as searching, filtering, partitioning, ...

For me it's the first choice when I find myself with a CSV file to work with. I've also encountered situations where just getting some raw data from a (slow) database and "querying" it with xsv ended up being the fastest option to get the results I wanted.


Also along these lines is tsv-utils:

https://github.com/eBay/tsv-utils/blob/master/README.md

There are benchmark results comparing tsv-utils to a variety of similar tools including xsv.


It's interesting that there doesn't seem to be a command-line tool out there that lets you do this with LevelDB. LevelDB is about an order of magnitude faster at importing a lot of data than SQLite or Postgres is (although I don't know if it's faster than the Postgres CSV foreign data wrapper for the first query, it sure will be for the second) and it can handle files a lot bigger than your RAM.

It's interesting to note that around 1990 The Mark Williams Company wrote a fairly full-fledged database system using, basically, CSV files. It was called "/rdb". You wrote your queries as shell scripts, but using a set of /rdb utilities that handled CSV files with labeled columns, and they had a screen-based data-entry UI based on vi. I wouldn't want to use it instead of a database --- you had to write your query plan in the shell script, rather than a high-level SQL-like query, because the authors didn't really understand SQL --- but it's interesting that the approach is still useful in 2019.


I was expecting the answer to involve SQLite and .import and am happy to have learned about this cool new tool.


I like the idea of using SQLite, and unlike Linq its available for any programming language that have bindings to SQLite


This is using SQLite behind the scenes, so it's pretty much what you expected.


On the contrary, xsv doesn't use SQLite at all


Ah, sorry, I mixed it up with http://harelba.github.io/q/


Shout out to Perl DBD::CSV. Gives you full SQL query functionality.


Hey, just wanted to chime in with another tool, OctoSQL, which I'm a co-author of.

It allows you to query and join data from multiple datasources simultaneously, which may include CSV files.

Currently available datasources are SQL databases, Redis, JSON and CSV (more coming...).

[1]:https://github.com/cube2222/octosql


There are many ways to do this. A valid reason for wanting to do this is that doing database imports takes time and sometimes you just want to get to the joined data set without having to wait for that.

A few years ago, I was trying to turn open streetmaps dumps into json documents. OSM ships as a huge bzipped XML dump of what is essentially 3 tables that need to be joined to do anything productive. One of those tables containes a few billion nodes. Importing that into postgreql takes many hours and takes up diskspace too. Bzipped, this stuff was only around 35GB. But it unpacks to >1TB. So, I wrote a few simple tools that processed the XML using regular expressions into gzipped files, sorted those on id and then joined files on id by simply scanning through multiple of them. Probably far from optimal but it got me results quick enough. Without taking ages or consuming lots of disk.


Any info on whether it does this in a streaming manner? For small files there are a plethora of solutions, but it gets more tricky / interesting when you can't assume both files fit in memory and (ideally) you want to do it in a single pass because I/O itself is a bottleneck.


It's almost done in a streaming fashion. `xsv join` uses memory proportional to the number of records in one of the inputs and the size of the join columns in one of the inputs. Namely, it creates an in-memory map from the join column's values to the corresponding record index. Combined with another in-memory index that maps record index to the corresponding record's byte offset, it can jump to the corresponding record in constant time while streaming over the other input.

There are better ways to do this that don't involve using this much memory, but this struck a good balance between memory usage and implementation complexity. I may improve on this some day.

So basically, as long as you have enough memory to store the entire join columns along with `8 * len(one_input)` bytes for the record index, then you should be good.


great! It may address some particular cases that I don't have a good tool for currently then. Thanks!


If the files are sorted, merge in Python (`heaps.merge` or `sorted` on chunks plus `itertools.groupby`). If they’re not sorted, a single pass is mostly out of the question — you need out-of-core sorting. GNU sort with intermediate compression of temporary/intermediate files is hard to beat there for performance, but if your join keys are more complicated than can be specified in GNU sort you need a more involved pipeline. If raw IO is actually the bottleneck, that’s usually a sign you need better compression — look into Zstandard and lz4, especially for intermediate files. If IO is still the bottleneck, consider possible sharding keys (Python ‘hash’ plus modulo works well, or CRC32 etc.) that would let you divide the problem up among a large number of workers. Everything is easier if you get the data into S3 storage early.


Pretty old now and unsupported but "crush tools" was fantastic for this kind of thing

https://github.com/google/crush-tools


This doesn’t seem be worked upon either?


Haven't tried XSV. In the same space is Q: https://harelba.github.io/q/


Also csvkit, xsv is way faster but less feature-complete (based on my experience last time I used both)


Linq can also be use for possibly similar use cases

https://docs.microsoft.com/en-us/dotnet/csharp/programming-g...


I haven't used xsv, but one of the nice feature appears to be that it will work on arbitrarily large datasets. From the README[0]:

> These commands are instantaneous because they run in time and memory proportional to the size of the slice (which means they will scale to arbitrarily large CSV data).

The example given with LINQ reads the whole files into memory:

  string[] names = System.IO.File.ReadAllLines(@"../../../names.csv");  
  string[] scores = System.IO.File.ReadAllLines(@"../../../scores.csv");
As a fellow LINQ (probably with LINQPad in this case for a quick and dirty script) user, I'd love to know LINQ can be used to read the files in "slices" like xsv. I'm sure this could be accomblished with enough code, but is there quick / easy way to do it?

[0]: https://github.com/BurntSushi/xsv#a-whirlwind-tour


>The example given with LINQ reads the whole files into memory:

The examples indeed use `File.ReadAllLines` which returns a `string[]`, but the same thing can be done with `File.ReadLines` which returns an `IEnumerable<string>` - a lazy sequence of lines read on demand as the sequence is enumerated.


XSV is extremely fast. I have a high frequency trading data api that powers my analytics. The data coming in as json which I write to a RAMDISK. Then I transform json to csv using jq and use xsv to get the data segments and strats I need.

Despite pretty basic code, the performance is very good.


Another tool that can do this and other operations on CSV files is CSVkit. I’ve had very good use for it.

https://csvkit.readthedocs.io/en/1.0.2/


Sigh, this is so much a case of "I have this nice hammer and I will use it for this screw". Maybe this works ok on simple joins with small CSV, but let's just import the CSV into database temp tables and get the job.


Surely you are better off simply running PostgreSQL locally, loading it into a couple of tables, and running a simple query?


As always, the correct answer depends on what you have installed, what you’re familiar with, how much data you have, and whether you’re going to be doing it on a regular basis and want to automate it.


Sure, but I think there is a widespread hesitation to use a db unless one's data reaches a higher level of size and complexity, which is misguided.

What I'm saying is that it's good to start with a database from the start, even if the problem seems too trivial for it. That way, if your problem grows into something that actually needs it, you already have it in a suitable database and the code to deal with that.


xsv can be inserted into a shell pipeline at the drop of a hat, regardless of how much data you have. Using a DB is typically much more effort. Just look at the other comments in this thread. Tons of complaints about it being slow and trying to finagle indices to make it better.

Sometimes functionality isn't the only thing that's important. Expression can be just as important.


This is awesome. Would love to see some performance benchmarks on this tool.



Very cool tool!


Full join is missing the Id on 456 and the Key on 999..


For an advanced challenge: do it for a hundred large bz2 files too big to fit on disk in an uncompressed state, such as the monthly-compiled hourly dumps of Wikipedia Pagecounts-EZ (https://dumps.wikimedia.org/other/pagecounts-ez/). This is feasible with CLI utils, but not trivial.


You can also join CSV or Excel files very easily using our new tool: Easy Data Transform: https://www.easydatatransform.com

It is a GUI tool for Windows and Mac. No syntax to remember. Just drag the two files on to EDT and click the 'Join' button then choose the columns to join.

Join is just one of 36 transforms available. There is a 7 day free trial.


You were downvoted because it's off-topic and more like an ad. Hn doesn't like ads, self promotions are allowed.

The title mentions joins, I'm not expecting a GUI-tool.


Thanks for the explanation. I get that you don't want overly promotional posts. But: -Easy Data Transform performs SQL-style joins on CSV files. It is one of its main use cases. -The OP didn't mention anything about it having to be command line.


I think it's more HN related than something else.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: