Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I asked this some time ago on their Discord in relation to AWS lambda and the Python client and the answer was that you need to handle caching on your own but it is easy to do with fsspec. I haven’t tried it yet though.


Do you have any details on this?

Duckdb over vanilla S3 has latency issues because S3 is optimized for bulk transfers, not random reads. The new AWS S3 Express Zone supports low-latency but there's a cost.

Caching Parquet reads from vanilla S3 sounds like a good intermediate solution. Most of the time, Parquet files are Hive-partitioned, so it would only entail caching several smaller Parquet files on-demand and not the entire dataset.


So the way I understand it you would use a fsspec.filesystem and specify a filecache (https://filesystem-spec.readthedocs.io/en/latest/features.ht...) and pass that to duckdb to use (https://duckdb.org/docs/guides/python/filesystems.html). Like I said I haven't tried this yet but it seems straightforward. They are also pretty responsive on Discord if you face any issues you can also try asking there (https://discord.com/invite/tcvwpjfnZx)


I really appreciate it! Thanks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: