To my understanding, it does not cache (but I haven't reviewed the code). "Expen...

To my understanding, it does not cache (but I haven't reviewed the code). "Expensive" here really just means expensive in time. If you're running DuckDB on an EC2 instance to query S3 on the same account, it's practically free of any other cost. I wouldn't bother doing it outside AWS unless it's a one-time thing.

Running a simple analytics query on ~4B rows across 6.6K parquet files in S3 on an m6a.xl takes around 7 minutes. And you can "index" these queries somewhat by adding dimensions in the path (s3://my-data/category=transactions/month=2024-05/rows1.parquet) which DuckDB will happily query on. So yeah, fairly expensive in time (but cheap for storage!). If you're just firehosing data into S3 and can add somewhat descriptive dimensions to your paths, you can optimize it a bit.