I can confirm usual compression ratios of 10-20 for JSON. For example, wikidata-20220103.json.gz is quite fun to work with. It is 109 GB, which decompresses to 1.4 TB, and even the non-compressed index for random access with indexed_gzip is 11 GiB. The compressed random access index format, which gztool supports, would be 1.4 GB (compression ratio 8). And rapidgzip even supports the compressed gztool format with further file size reduction by doing a sparsity analysis of required seek point data and setting all unnecessary bytes to 0 to increase compressibility. The resulting index is only 536 MiB.
The trick for the mix of JSON with binary is a good reminder. That's how the ASAR file archive format works. That could indeed be usable for what I was working on: a new file format for random seek indexes. Although the gztool index format seems to suffice for now.
The trick for the mix of JSON with binary is a good reminder. That's how the ASAR file archive format works. That could indeed be usable for what I was working on: a new file format for random seek indexes. Although the gztool index format seems to suffice for now.