Bit Packing like a Madman (2016)

dragontamer · on Sept 16, 2021

I didn't watch the video, just a brief lookover at the slides.

Bitpacking is one of many memory/CPU tradeoffs. You use a little less ram, but a little more CPU power. (In contrast: padding is using a bit more RAM, but a bit less CPU-power)

The thing about modern CPUs is that they're incredibly parallel today. I think AMD Zen3 has 10 pipelines per core (and 6 of those pipelines can execute 256-bit SIMD instructions per cycle). Intel is around the same, Apple M1 focuses on more Integer-pipelines than SIMD-pipelines, but also is incredibly parallel. As such, we programmers have been given increasingly huge amounts of CPU power. Even our cell phones are capable of executing 4 to 8 instructions per clock tick per core.

In contrast: DDR3, DDR4, and DDR5 RAM may have had improvements, but no where close to the gains that CPUs have gained. So a lot of our CPU-pipelines are sitting around empty, doing nothing... just waiting for RAM to respond.

As such: more and more applications will be benefiting from +CPU usage and -RAM use tradeoffs. The CPU-power vs RAM tradeoffs are becoming incredibly lopsided: over in GPU-land, visual effect programmers are regularly using in-memory compression and decompressing the textures on the fly. There are people today who claim Zstandard in-memory compression is faster than reading/writing to RAM directly.

Spending a few bitmask / bitshift instructions to read/write bits inside of a 64-bit integer is way less CPU-power compared to using ZStandard on all your in-memory JSON / XML strings, but is basically the same effect from a strategic point of view.

mamcx · on Sept 16, 2021

This make me wonder how could this be applied in an interpreter. I have one with array/relational semantics and have wonderer how to pack N-values per row/column that is efficient to encode/Decode, but always fear that all the extra processing will be costly and then just let Rust do his magic.

dragontamer · on Sept 17, 2021

I mean, the #1 job of any optimizing programmer is rather simple when you think of it.

1. Program attempt A.

2. Program attempt B using the same interface.

3. Develop a suite of tests.

4. Profile A and B using those tests.

5. Choose the methodology that uses fewer resources (in particular: time, memory, cache).

6. Use experience from 1-5 to make attempt C. Then repeat again.

#5 is worded like that on purpose. Different programs will be constrained by different resources. Time itself is often times a constraint on resources (run out of RAM, your program can continue but only through Swap. Run out of L3 cache, your program will continue, but through RAM). A methodology that uses less RAM will stay in L1 cache longer, but if everyone's data is so huge you're in DDR4, maybe a different methodology would work instead.