Have a go with duckdb next time - you can query csv files without loading them f...

simonw · on Jan 28, 2023

You can do that with SQLite too: https://til.simonwillison.net/sqlite/one-line-csv-operations

(DuckDB is a lot more ergonomic for that kind of thing though - it's really fantastic tech)

pletnes · on Jan 29, 2023

Cool! Although I believe duckdb can do it on disk / out of memory, so querying huge files are possible. I also like its syntax, I tend to CREATE VIEW mycsv AS SELECT * FROM ‘my.csv’ (or similar). Then I think you can select or join even across files, although I haven’t gotten that far yet.

sam_lowry_ · on Jan 28, 2023

Unlike sqlite, DuckDB is a very complex and much less polished.

deaddodo · on Jan 28, 2023

The spirit/answer being sought of the question is to not have to load all of the data in memory first.

pletnes · on Jan 29, 2023

I believe duckdb does not load the whole csv file into memory. It will load a few rows to find column headers and guess data types.

deaddodo · on Jan 30, 2023

You’re just pettyfogging the situation. The spirit of the question is to find a solution that is acceptable/performant algorithmically.

Certainly, there are hiring panels that appreciate these sorts of tricks to go around the solution, usually citing “out of the box” thinking, but the majority would probably just say “do it without that solution” or mark you as a fail.