I can confirm usual compression ratios of 10-20 for JSON. For example, wikidata-... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		mxmlnkn 9 months ago \| parent \| context \| favorite \| on: I prefer human-readable file formats I can confirm usual compression ratios of 10-20 for JSON. For example, wikidata-20220103.json.gz is quite fun to work with. It is 109 GB, which decompresses to 1.4 TB, and even the non-compressed index for random access with indexed_gzip is 11 GiB. The compressed random access index format, which gztool supports, would be 1.4 GB (compression ratio 8). And rapidgzip even supports the compressed gztool format with further file size reduction by doing a sparsity analysis of required seek point data and setting all unnecessary bytes to 0 to increase compressibility. The resulting index is only 536 MiB. The trick for the mix of JSON with binary is a good reminder. That's how the ASAR file archive format works. That could indeed be usable for what I was working on: a new file format for random seek indexes. Although the gztool index format seems to suffice for now.

out_of_protocol 9 months ago [–]

1) replacing gzip compression with zstd will speed things up by a lot while also reducing disk size

2) Plain old sqlite seems like a good idea, for a format (and also widely supported). Fast indexes included

3) combining (1) and (2) is probably a good idea as well

4) there's also Parquet

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact