I asked this some time ago on their Discord in relation to AWS lambda and the Py...

wenc · on May 30, 2024

Do you have any details on this?

Duckdb over vanilla S3 has latency issues because S3 is optimized for bulk transfers, not random reads. The new AWS S3 Express Zone supports low-latency but there's a cost.

Caching Parquet reads from vanilla S3 sounds like a good intermediate solution. Most of the time, Parquet files are Hive-partitioned, so it would only entail caching several smaller Parquet files on-demand and not the entire dataset.

ayhanfuat · on May 30, 2024

So the way I understand it you would use a fsspec.filesystem and specify a filecache (https://filesystem-spec.readthedocs.io/en/latest/features.ht...) and pass that to duckdb to use (https://duckdb.org/docs/guides/python/filesystems.html). Like I said I haven't tried this yet but it seems straightforward. They are also pretty responsive on Discord if you face any issues you can also try asking there (https://discord.com/invite/tcvwpjfnZx)

wenc · on May 30, 2024

I really appreciate it! Thanks.