PostgreSQL can do it on live data without importing anything. Set up the "file_fwd" extension:
CREATE EXTENSION file_fdw;
CREATE SERVER fdw FOREIGN DATA WRAPPER file_fdw;
Declare tables:
CREATE FOREIGN TABLE person (id int, sex text)
SERVER fdw
OPTIONS (filename '/tmp/person.csv', format 'csv');
CREATE FOREIGN TABLE weight (id int, weight int)
SERVER fdw
OPTIONS (filename '/tmp/weight.csv', format 'csv');
Now:
SELECT * FROM person JOIN weight ON weight.id = person.id;
That loads all the data into memory, though. Fine if you're joining all of it, but with the Postgres example you can throw the whole SQL language at the data, and it will be streamed and pipelined efficiently just like with normal Postgres tables.
Original article doesn't say anything about "database join". It's about joining two datasets by some common ID.
R in this case fits the bill and even allows for some relational algebra here (e.g. INNER JOIN would be merge(X, Y, all=FALSE), LEFT OUTER JOIN: merge(X, Y, all.x=TRUE), etc...)
No, "import" means "create state from data". It's not creating any persistent state at all.
But it's an important nuance for another reason, because you can operate on the original files in real time. You can update the underlying files and your queries will reflect the changes.
The downside is since the data is streamed at query time, it's less efficient and doesn't allow indexes to be created.
I can't speak for XSV but when I had to do similar analysis of logfiles in range of hundreds of thousands of records - SQLite would consistently seize up on my machine and make it impossible to do analysis. Had i know about XSV I may have tried it, but GNU Join saved my bacon that week and would do my analysis in minutes.
That's why you can `EXPLAIN` your query, and maybe even add indices, to speed things up!
Plus in the last few years SQLite has gotten a lot more powerful, and has added lots of JSON support and has improved index usage. I'd avoid working with lots of data as flat ASCII text, because of all the I/O and wasted cycles to read the data and insert it into some data structure and then write it out as flat ASCII text, but it can be super debuggable.
The join utility is actually a part of POSIX[1], so every UNIX should
have one. Here's one from OpenBSD[2], for example. The GNU version
probably has more flags though.
4-6 hours still seems like a long time for this, did you have sorted indices on the join columns? How high of cardinality (distinct counts) did the join columns have? Traditional RDBMS are pretty fast at joining when you have the right indices in place, especially when the cardinality of the join columns isn't very high.
And how much of that time was just getting the data into the tables? There are fast and slow ways to do that, too...
IIRC, it took me 4-7 hours to load 600GB of uncompressed data to the PostgreSQL after disabling WAL and tunning some config variables. I didn't have time to test all possible speed improvements.
After import, it takes 3-6 hours additionally to create indexes for those tables.
I think the main problem was that I had text indices. I'm not an expert in RDBMS and used a simple join that was taking ages to start producing the actual data. There are definitely a lot of ways to tune such queries and PostgreSQL configs, but I wanted a simple and universal solution.
Not GP and nothing to do with them, but I have an example of a CSV format which yields absurdly long lines.
We have a CSV for timeseries data, where for some reason someone decided to force each line to represent one data point. However, for some cases, one data point may contain around 30 different statistics for 4 million different IPs, which get represented as 120 million columns in the CSV (think a CSV header like 'timestamp,10.0.0.1—Throughput,10.0.0.1-DroppedBytes,[...],10.215.188.251-Throughput,[...]'). With large numbers represented as text in a CSV, this can sometimes reach more than one 1GB per line.
When I've had to do similar, the easiest and fastest way has always been install spark -> Create an external spark sql table from the csvs -> use spark sql to join and aggregate/write. If you haven't used spark before it sounds complicated but its actually really quick and easy.
In general the operation on my laptop can get up to 200 MB/s, it's basically IO limited to the SSD of your machine.
Since the OP noted that they were trying to accomplish this task on a c4.xlarge machine, the data is already "in the cloud" and so transfer should be blazingly fast.
Using the zipped CSV at [1] as a size estimate, I'd ballpark 300 million CSV rows as somewhere between 6.7-10.6 GB:
Not that I want to be the "well actually" guy...but MS has a much underrated tool called Log Parser that's been around for years that lets you do this kinda thing and more.
If you're a Windows admin and haven't bumped into this utility then it's well worth a look.
edit: forgot to mention that it can also scripted from VBScript and Powershell for added fun.
There is very little overlap between what xsv does and what standard Unix tools like `join` do. Chances are, if you're using xsv for something like this, then you probably can't correctly use `join` to do it because `join` does not understand the CSV format.
If your CSV data happen to fall into the subset of the CSV format that does not include escaped field separators (or record separators), then a tool like `join` could work. Notably, this might include TSV (tab separated files) or files that use the ASCII field/record separators (although I have literally never seen such a file in the wild). But if it's plain old comma separated values, then using a tool like `join` is perilous.
I didn't write xsv out of ignorance of standard line oriented tools. I wrote it specifically to target problems that cannot be solved by standard line oriented tools.
You might also argue that data should not be formatted in such a way, and philosophically, I don't necessarily disagree with you. But xsv is not a philosophical tool. It is a practical tool to deal with the actual data you have.
They are also the source of much highly practical domain-specific knowledge and advice on how to write Rust code that does e.g.; efficient text file reading, streaming, state machines, etc.
@burntsushi ... Sir .. need to say this.. many thanks for your tools! They make my life so clean and so manageable. Data wrangling on the command line with your tools is such a great experience (minimalistic and efficient).
Is there anything for `json` files that you would recommend? `jq` is awesome but just wondering.
Thanks for the info. That is definitely good to have spelled out.
I don't believe I ever said that xsv shouldn't exist, or has no legitimate purpose, or was written out of ignorance - just that join is useful for similar tasks and worth knowing about, particularly because it's usually available without an install or compile.
You've obviously done a lot more CSV wrangling than I have, and the info here on what exactly xsv does that join cannot is helpful.
The reason why DuckDB is a better fit for this job is because DuckDB is a column store and has a block-oriented vectorized execution engine. This approach is orders of magnitude faster when you’re doing batch operations with millions of rows at a time.
In contrast, SQLite would be orders of magnitude faster that DuckDB when you’re operating on one row at a time.
Main author of DuckDB here, I did not expect to see this mentioned here. DuckDB is a relational DBMS geared towards efficiently handling large analytical-style workloads locally. It is similar to SQLite in the sense that it operates locally on your machine, is easy to run and install and has zero dependencies. However, DuckDB uses modern processing paradigms (vectorized processing, columnar storage) that make it much faster when processing large amounts of data.
It's still in an early stage currently, however, most of the functionality is there (full SQL support, permanent storage, ACID properties). Feel free to give it a try if you are interested. DuckDB has Python and R bindings, and a shell based off of the sqlite3 shell. You can find installation instructions here: https://www.duckdb.org/docs/current/tutorials/installation.h...
I wonder what niche these tools fill. Maybe when you want to do some ad-hoc analysis and want every step in your shell history?
Otherwise it seems more flexible to just fire up a python interpreter and do it in like 3 lines of pandas (or with sqlite and .import like another commenter mentioned)
I’ve never had to use joins, but I’ve used shell scripts in CI environments where I can’t install stuff. For example, I had a Jekyll site generated from a project in some Linux ruby container. I couldn’t install python (or this xsv tool) but still wanted to summarize some of the data for display in one of the pages. While I would certainly rather use python, usinf shell was easier than convincing the admin of that server to install python into the ruby image, or trying to learn ruby and however they do simple data counts.
This tool is weird because if I’m going to download some tool that’s not in the distro, I’d rather use SQLite or try to figure it out with awk.
The shell is a very powerful REPL for incremental development, so you can often do things more rapidly by looking with `xsv headers`, then joining with `xsv join | xsv select | head | xsv table` than using python or sqlite.
I definitely used to use `sqlite` for every CSV before and I've shifted to using `xsv`. If I want to do any heavy lifting I'm going to pop it into PostgreSQL anyway.
Crucial feature is that it's easy to pop some `xsv` pipeline in the middle of your `for i in ...` or `find . -print0 | xargs -0`.
For me it's ad-hoc analysis on large CSV files. Large meaning well beyond what Excel would be capable of, often larger than fits into memory on my local machine (10s of GiB).
Sometimes I also use xsv to just do a step of the analysis and dive deeper on some subset using pandas.
In my experience both SQLite and Pandas aren't as fast as fast for large files. So they are not really good options.
Pandas is especially bad because it uses a column oriented data structure internally so reading from or writing to CSV is incredibly slow in Pandas. If you can use parquet that's not a problem but unfortunately parquet is not nearly is ubiquitous as csv :(
If Pandas is slow, than you can use Spark. For such big files laptop is not an option anyway. SQLite can be fast if you index your data (but I've worked with files < 10G). Nowadays I am just uploading CSV to some cloud database and work with data there.
> For such big files laptop is not an option anyway
Too big for excel is not big data, and my laptop can load this 10G in RAM (not that it necessarily need all of it) so why not if the data is here and the laptop on your lap ?
I had to do this type of work frequently at an old job. We'd built short-run software systems for clients.
Often this resulted in one-off data loads or reporting metrics. It was almost always easier just to knock something out with command line tools than it was to write, test, and deploy code. The deployment process itself could take longer than actually processing the file.
The same niche grep, sed, cut, sort, etc.. fill but specifically for csv and, using them together (xsv with other unix tools) can get you quickly very far on some tasks.
Oracle has a feature called external tables that lets you map regular files to the database and run SQL against them. Works with csv or any format that can be loaded to the database with the normal database loading utilities. This is very useful for getting transformation and loading done in one step!
The feature actually uses the syntax from the old SQLLDR tool, and can also specify fields in absolute columnar position as well as defining delimiters (reminiscent of gawk). There are also conditionals (when), rounding rules, and other transformations.
I have used the odbc drivers for csv previously. I agree with your assessment, never again. They are unforgiving in what they except and not a robust solution at all. I spent months cleaning up edge cases the driver should have been able to handle.
The built-in tooling for CSV in PowerShell is excellent (provided you don't care about details like if numbers are quoted or not) - I've been using that for csv-related ETL tasks.
`xsv` can do far more than just join and select, such as searching, filtering, partitioning, ...
For me it's the first choice when I find myself with a CSV file to work with. I've also encountered situations where just getting some raw data from a (slow) database and "querying" it with xsv ended up being the fastest option to get the results I wanted.
It's interesting that there doesn't seem to be a command-line tool out there that lets you do this with LevelDB. LevelDB is about an order of magnitude faster at importing a lot of data than SQLite or Postgres is (although I don't know if it's faster than the Postgres CSV foreign data wrapper for the first query, it sure will be for the second) and it can handle files a lot bigger than your RAM.
It's interesting to note that around 1990 The Mark Williams Company wrote a fairly full-fledged database system using, basically, CSV files. It was called "/rdb". You wrote your queries as shell scripts, but using a set of /rdb utilities that handled CSV files with labeled columns, and they had a screen-based data-entry UI based on vi. I wouldn't want to use it instead of a database --- you had to write your query plan in the shell script, rather than a high-level SQL-like query, because the authors didn't really understand SQL --- but it's interesting that the approach is still useful in 2019.
There are many ways to do this. A valid reason for wanting to do this is that doing database imports takes time and sometimes you just want to get to the joined data set without having to wait for that.
A few years ago, I was trying to turn open streetmaps dumps into json documents. OSM ships as a huge bzipped XML dump of what is essentially 3 tables that need to be joined to do anything productive. One of those tables containes a few billion nodes. Importing that into postgreql takes many hours and takes up diskspace too. Bzipped, this stuff was only around 35GB. But it unpacks to >1TB. So, I wrote a few simple tools that processed the XML using regular expressions into gzipped files, sorted those on id and then joined files on id by simply scanning through multiple of them. Probably far from optimal but it got me results quick enough. Without taking ages or consuming lots of disk.
Any info on whether it does this in a streaming manner? For small files there are a plethora of solutions, but it gets more tricky / interesting when you can't assume both files fit in memory and (ideally) you want to do it in a single pass because I/O itself is a bottleneck.
It's almost done in a streaming fashion. `xsv join` uses memory proportional to the number of records in one of the inputs and the size of the join columns in one of the inputs. Namely, it creates an in-memory map from the join column's values to the corresponding record index. Combined with another in-memory index that maps record index to the corresponding record's byte offset, it can jump to the corresponding record in constant time while streaming over the other input.
There are better ways to do this that don't involve using this much memory, but this struck a good balance between memory usage and implementation complexity. I may improve on this some day.
So basically, as long as you have enough memory to store the entire join columns along with `8 * len(one_input)` bytes for the record index, then you should be good.
If the files are sorted, merge in Python (`heaps.merge` or `sorted` on chunks plus `itertools.groupby`). If they’re not sorted, a single pass is mostly out of the question — you need out-of-core sorting. GNU sort with intermediate compression of temporary/intermediate files is hard to beat there for performance, but if your join keys are more complicated than can be specified in GNU sort you need a more involved pipeline. If raw IO is actually the bottleneck, that’s usually a sign you need better compression — look into Zstandard and lz4, especially for intermediate files. If IO is still the bottleneck, consider possible sharding keys (Python ‘hash’ plus modulo works well, or CRC32 etc.) that would let you divide the problem up among a large number of workers. Everything is easier if you get the data into S3 storage early.
I haven't used xsv, but one of the nice feature appears to be that it will work on arbitrarily large datasets. From the README[0]:
> These commands are instantaneous because they run in time and memory proportional to the size of the slice (which means they will scale to arbitrarily large CSV data).
The example given with LINQ reads the whole files into memory:
As a fellow LINQ (probably with LINQPad in this case for a quick and dirty script) user, I'd love to know LINQ can be used to read the files in "slices" like xsv. I'm sure this could be accomblished with enough code, but is there quick / easy way to do it?
>The example given with LINQ reads the whole files into memory:
The examples indeed use `File.ReadAllLines` which returns a `string[]`, but the same thing can be done with `File.ReadLines` which returns an `IEnumerable<string>` - a lazy sequence of lines read on demand as the sequence is enumerated.
XSV is extremely fast. I have a high frequency trading data api that powers my analytics. The data coming in as json which I write to a RAMDISK. Then I transform json to csv using jq and use xsv to get the data segments and strats I need.
Despite pretty basic code, the performance is very good.
Sigh, this is so much a case of "I have this nice hammer and I will use it for this screw". Maybe this works ok on simple joins with small CSV, but let's just import the CSV into database temp tables and get the job.
As always, the correct answer depends on what you have installed, what you’re familiar with, how much data you have, and whether you’re going to be doing it on a regular basis and want to automate it.
Sure, but I think there is a widespread hesitation to use a db unless one's data reaches a higher level of size and complexity, which is misguided.
What I'm saying is that it's good to start with a database from the start, even if the problem seems too trivial for it. That way, if your problem grows into something that actually needs it, you already have it in a suitable database and the code to deal with that.
xsv can be inserted into a shell pipeline at the drop of a hat, regardless of how much data you have. Using a DB is typically much more effort. Just look at the other comments in this thread. Tons of complaints about it being slow and trying to finagle indices to make it better.
Sometimes functionality isn't the only thing that's important. Expression can be just as important.
For an advanced challenge: do it for a hundred large bz2 files too big to fit on disk in an uncompressed state, such as the monthly-compiled hourly dumps of Wikipedia Pagecounts-EZ (https://dumps.wikimedia.org/other/pagecounts-ez/). This is feasible with CLI utils, but not trivial.
It is a GUI tool for Windows and Mac. No syntax to remember. Just drag the two files on to EDT and click the 'Join' button then choose the columns to join.
Join is just one of 36 transforms available. There is a 7 day free trial.
Thanks for the explanation. I get that you don't want overly promotional posts. But:
-Easy Data Transform performs SQL-style joins on CSV files. It is one of its main use cases.
-The OP didn't mention anything about it having to be command line.