DuckDB is faster at counting the lines of a CSV file than wc

latexr · 2024-12-02T16:30:44 1733157044

“Modern tool is faster than old tool” isn’t exactly impressive. But either way, I’m curious about the accuracy of this measurement.

The `wc` command lists 99% CPU usage while the `duckdb` command lists 930%. Presumably the machine has a bunch of CPU cores being used in parallel in the latter case, while the former is bound to be using just one. Doesn’t that mean that on a less capable machine, `wc` might win? This seems quite selective. Yes, DuckDB could be faster, in specific situations. And while it’s a significant increase in relative terms, not so much in absolute terms. If you were writing the command by hand, it would take you longer to type the DuckDB version than any gains they showed. Plus it’s maxing out the machine while processing, which could further skew the results if anything else is going on.

So in the end the claim becomes “DuckDB might be marginally faster than wc in certain situations for a specific use case”. Which I guess is fine for a tweet, and kudos to the DuckDB team for caring about performance, but doesn’t seem any more worthy of discussion beyond that.

mdaniel · 2024-12-02T17:12:00 1733159520

I struggle to think of what parallel processing would mean when parsing(!) a newline delimited file. If the file format had offsets specified (e.g. pdf), then I'm on board with delegating the parsing of each declared segment - like a very small map-reduce program

But CSVs are infamous for yolo-ing quoting and EOL treatment so just ripping through and spawning threads for each discovered EOL sequence seems like a recipe for misparse. I guess it's possible to guard the thread worker with the expected number of columns and if their parse attempt doesn't find that many then they can bail out with the expectation that the worker assigned the actual EOL sequence will rescue the error. But what a waste of I/O (and cpu, of course)

mattewong · 2024-12-05T03:47:28 1733370448

This is misleading. First, as other comments have noted, it is comparing multi-threaded/parallelized vs single-threaded, and its total CPU time is much longer than wc's. Second, it suggests there is something special going on, when there is not. Just breaking the file into parts and running wc -l on it-- or even, running a CSV parser that is much more versatile than DuckDB's-- I'm pretty confident will perform significantly faster than this showing. Bets anyone?

szarnyasg · 2024-12-05T23:27:00 1733441220

I am the author of the original post and I also wrote a followup blog post on it yesterday: https://szarnyasg.org/posts/duckdb-vs-coreutils/

Yes, if you break the file into parts with GNU Parallel, you can easily beat DuckDB as I show in the blog post.

That said, I maintain that it's surprising that DuckDB outperforms wc (and grep) on many common setups, e.g., on a MacBook. This is not something many databases can do, and the ones which can usually don't run on a laptop.

mattewong · 2024-12-09T21:08:34 1733778514

Your follow-up post is helpful and appreciated!

Re the original analysis, my own opinion is that the outcome is only surprising when the critical detail, highlighting how the two are different, is omitted. It seems very unsurprising if it is rephrased to include that detail: "DuckDB, executed multi-threaded + parallelized, is 2.5x faster than wc, single-threaded, even though in doing so, DuckDB used 9.3x more CPU".

In fact, to me, the only thing that seems surprising about that is how poorly DuckDB does compared to WC-- 9x more CPU for only 2.5x more improvement.

But an interesting analysis regardless of the takeaways-- thank you

zoezoezoezoe · 2024-12-02T16:14:52 1733156092

parallel processing strikes again