Lookup tables with precalculated things for the win! In fact I don’t think we wo...

crmd · 2025-05-22T00:23:09 1747873389

Reminds me of when I started working on storage systems as a young man and once suggested pre-computing every 4KB block once and just using pointers to the correct block as data is written, until someone pointed out that the number of unique 4KB blocks (2^32768) far exceeds the number of atoms in the universe.

manwe150 · 2025-05-22T02:21:41 1747880501

It seems like you weren’t really that far off from implementing it, you just need a 4 KB pointer to point to the right block. And in fact, that is what all storage systems do!

jodrellblank · 2025-05-22T02:22:43 1747880563

Reminds me of when I imagined brute-forcing every possible small picture as simply 256 shades of gray for each pixel x (640 x 480 = 307200 pixels) = 78 million possible pictures.

Actually I don't have any intuition for why that's wrong, except that if we catenate the rows into one long row then the picture can be considered as a number 307200 digits long in base 256, and then I see that it could represent 256^307200 possible different values. Which is a lot: https://www.wolframalpha.com/input?i=256%5E307200

p1necone · 2025-05-22T03:01:10 1747882870

78 million is how many pixels would be in 256 different pictures with 307200 pixels each. You're only counting each pixel once for each possible value, but you actually need to count each possible value on each pixel once per possible combinations of all of the other pixels.

The number of possible pictures is indeed 256^307200, which is an unfathomably larger number than 78 million. (256 possible values for the first pixel * 256 possible values for the second pixel * 256 possi...).

danwills · 2025-05-22T09:40:29 1747906829

Yeah I had a similar thought back in the 90s and made a program to iterate through all possible images at a fairly low res, I left it running while I was at school and got home after many hours to find it had hardly got past the first row of pixels! This was a huge eye-opener about how big a possibility-space digital images really exist in!

mystified5016 · 2025-05-22T16:47:01 1747932421

I has the same idea when I first learned about programming as a teenager. I wonder how many young programmers have had this exact same train of thought?

deadfoxygrandpa · 2025-05-22T06:20:29 1747894829

i think at some point you should have realized that there are obviously more than 78 million possible greyscale 640x480 pictures. theres a lot of intuitive examples but just think of this:

https://images.lsnglobal.com/ZFSJiK61WTql9okXV1N5XyGtCEc=/fi...

if there were only 78 million possible pictures, how could that portrait be so recongizably one specific person? wouldnt that mean that your entire picture space wouldnt even be able to fit a single portrait of everyone in Germany?

jodrellblank · 2025-05-22T12:57:58 1747918678

"At some point" I do realise it. What I don't have is an intuitive feel for why a number can be three digits 000 to 999 and each place has ten choices, but it's not 10 x 3 possibles. I tried to ask ChatGPT to give me an intuition for it, but all it does is go into an explanation of combinations. I know it's 10 x 10 x 10 meaning 10^3 I don't need that explanation again, what I'm looking for is an intuition for why it isn't 10x3.

> "if there were only 78 million possible pictures, how could that portrait be so recongizably one specific person? wouldnt that mean that your entire picture space wouldnt even be able to fit a single portrait of everyone in Germany?"

It's not intuitive that "a 640x480 computer picture must be able to fit a single portrait of everyone in Germany"; A human couldn't check it, a human couldn't remember 78 million distinct pictures, look through them, and see that they all look sufficiently distinct and at no point is it representing 50k people with one picture; human attention and memory isn't enough for that.

ThomasBHickey · 2025-05-22T14:39:42 1747924782

Try starting with a 2x2, then 3x3, etc. image and manually list all the possibilities.

jodrellblank · 2025-05-23T00:51:32 1747961492

That's focusing on the wrong thing; as I said, "I know it's 10 x 10 x 10 meaning 10^3 I don't need that explanation [for the correct combinations], what I'm looking for is an intuition for why it isn't 10x3".

cwmoore · 2025-05-25T23:17:59 1748215079

ChatGPT might be able to explain combinatorics if you use the keyterm.

I’m fond of derangements and their relationship with permutations, which contain a factor of e.

plastic3169 · 2025-05-22T09:34:30 1747906470

I had friend who had the same idea to do it for pixel fonts with only two colors and 16x16 canvas. It was still 2^256. Watching that thing run and trying to estimate when it would finish made me understand encryption.

benchloftbrunch · 2025-05-22T13:46:27 1747921587

The other problem is that (if we take literally the absurd proposal of computing "every possible block" up front) you're not actually saving any space by doing this, since your "pointers" would be the same size as the blocks they point to.

lesuorac · 2025-05-22T16:48:19 1747932499

If you don't do _actually_ every single block then you have Huffman Coding [1].

I imagine if you have a good idea of the data incoming you could probably do a similar encoding scheme where you use 7 bits to point to a ~512 bit blob and the 8th bit means the next 512 couldn't be compressed.

[1]: https://en.wikipedia.org/wiki/Huffman_coding

makmanalp · 2025-05-22T01:30:53 1747877453

In some contexts, dictionary encoding (which is what you're suggesting, approximately) can actually work great. For example common values or null values (which is a common type of common value). It's just less efficient to try to do it with /every/ block. You have to make it "worth it", which is a factor of the frequency of occurrence of the value. Shorter values give you a worse compression ratio on one hand, but on the other hand it's often likelier that you'll find it in the data so it makes up for it, to a point.

There are other similar lightweight encoding schemes like RLE and delta and frame of reference encoding which all are good for different data distributions.

ww520 · 2025-05-22T01:13:20 1747876400

The idea is not too far off. You could compute a hash on an existing data block. Store the hash and data block mapping. Now you can use the hash in anywhere that data block resides, i.e. any duplicate data blocks can use the same hash. That's how storage deduplication works in the nutshell.

valenterry · 2025-05-22T01:18:45 1747876725

Except that there are collisions...

datameta · 2025-05-22T01:32:20 1747877540

This might be completely naive but can a reversible time component be incorporated into distinguishing two hash calculations? Meaning when unpacked/extrapolated it is a unique signifier but when decomposed it folds back into the standard calculation - is this feasible?

shakna · 2025-05-22T05:54:16 1747893256

Some hashes do have verification bits, that are used not just to verify intact hash, but one "identical" hash from another. However, they do tend to be slower hashes.

grumbelbart2 · 2025-05-22T06:15:24 1747894524

Do you have an example? That just sounds like a hash that is a few bits longer.

shakna · 2025-05-22T06:27:14 1747895234

Mostly use of GCM (Galois/Counter Mode). Usually you tag the key, but you can also tag the value to check verification of collisions instead.

But as I said, slow.

ruined · 2025-05-22T05:47:36 1747892856

hashes by definition are not reversible. you could store a timestamp together with a hash, and/or you could include a timestamp in the digested content, but the timestamp can’t be part of the hash.

RetroTechie · 2025-05-22T13:44:16 1747921456

> hashes by definition are not reversible.

Sure they are. You could generate every possible input, compute hash & compare with a given one.

Ok it might take infinite amount of compute (time/energy). But that's just a technicality, right?

dagw · 2025-05-22T13:47:43 1747921663

Sure they are. You could generate every possible input

Depends entirely on what you mean by reversible. For every hash value, there are an infinite number of inputs that give that value. So while it is certainly possible to find some input that hashes to a given value, you cannot know which input I originally hashed to get that that value.

datameta · 2025-05-24T19:45:59 1748115959

Oh, of course, the timestamp could instead be metadata!

ww520 · 2025-05-22T16:16:24 1747930584

Can use cryptographic hashing.

anonymars · 2025-05-22T17:02:11 1747933331

How does that get around the pigeonhole principle?

I think you'd have to compare the data value before purging, and you can only do the deduplication (purge) if the block is actually the same, otherwise you have to keep the block (you can't replace it with the hash because the hash link in the pool points to different data)

ww520 · 2025-05-23T01:00:15 1747962015

The hash collision chance is extremely low.

valenterry · 2025-05-23T01:21:47 1747963307

For small amounts of data yeah. With growing data, the chance of a collision grows more than proportional. So in the context of working on storage systems (like s3 or so) that won't work unless customers actually accept the risk of a collission as okay. So for example, when storing media data (movies, photos), I could imagine that, but not for data in general.

ww520 · 2025-05-23T02:07:27 1747966047

Cryptographic hashing collisions are very very small, like end of universe in numerous times small. They're smaller than AWS being burnt down and all backups were lost leading to data loss.

valenterry · 2025-05-23T04:59:03 1747976343

You have a point.

When using MD5 (128bit) then when AWS S3 would apply this technique, it would only get a handful of collisions. Using 256bit would drive that down to a level where any collision is very unlikely.

This would be worth it if a 4kb block is, on average, duplicated with a chance of at least 6.25%. (not considering overhead of data-structures etc.)

Nevermark · 2025-05-22T17:15:12 1747934112

The other problem is to address all possible 4098 byte blocks, you need a 4098 byte address. I suppose we would expect the actual number of blocks computed and reused to be a sparse subset.

Alternately, have you considered 8 byte blocks?

If your block pointers are 8-byte addresses, you don't need to count on block sparsity, in fact, you don't even need to have the actual blocks.

A pointer type, that implements self-read and writes, with null allocations and deletes, is easy to implement incredibly efficiently in any decent type system. A true zero-cost abstraction, if I have ever seen one!

(On a more serious note, a memory heap and CPU that cooperated to interpret pointers with the top bit set, as a 63-bit linear-access/write self-storage "pointer", is an interesting thought.

nine_k · 2025-05-22T04:19:02 1747887542

If some blocks are highly repetitive, this may make sense.

It's basically how deduplication works in ZFS. And that's why it only makes sense when you store a lot of repetitive data, e.g. VM images.

whatever1 · 2025-05-22T04:36:21 1747888581

We know for a fact that when we disable the cache of the processors their performance plummets, so the question is how much of computation is brand new computation (never seen before)?

vlovich123 · 2025-05-22T06:04:20 1747893860

While true, a small technical nitpick is that the cache also contains data that’s previously been loaded and reused, not just as a result of a previous computation (eg your executable program itself or a file being processed are examples)

Traubenfuchs · 2025-05-22T06:14:28 1747894468

you might be interested in pifs

https://github.com/philipl/pifs

EGreg · 2025-05-22T01:27:03 1747877223

You’re not wrong

Using an LLM and caching eg FAQs can save a lot of token credits

AI is basically solving a search problem and the models are just approximations of the data - like linear regression or fourier transforms.

The training is basically your precalculation. The key is that it precalculates a model with billions of parameters, not overfitting with an exact random set of answers hehe

walterbell · 2025-05-22T09:30:58 1747906258

> Using an LLM and caching eg FAQs can save a lot of token credits

Do LLM providers use caches for FAQs, without changing the number of tokens billed to customer?

EGreg · 2025-05-22T19:29:12 1747942152

No, why would they. You are supposed to maintain that cache.

What I really want to know is about caching the large prefixes for prompts. Do they let you manage this somehow? What about llama and deepseek?

chowells · 2025-05-21T22:12:51 1747865571

Oh, that's not a problem. Just cache the retrieval lookups too.

michaelhoney · 2025-05-21T23:20:25 1747869625

it's pointers all the way down

drob518 · 2025-05-22T01:13:31 1747876411

Just add one more level of indirection, I always say.

EGreg · 2025-05-22T01:28:44 1747877324

But seriously… the solution is often to cache / shard to a halfway point — the LLM model weights for instance — and then store that to give you a nice approximation of the real problem space! That’s basically what many AI algorithms do, including MCTS and LLMs etc.

mncharity · 2025-05-22T00:39:36 1747874376

> if we were centrally storing all of the operations

Community-scale caching? That's basically what pre-compiled software distributions are. And one idea for addressing the programming language design balk "that would be a nice feature, but it's not known how to compile it efficiently, so you can't have it", is highly-parallel cloud compilation, paired with a community-scale compiler cache. You might not mind if something takes say a day to resolve, if the community only needs it run once per release.

20after4 · 2025-05-22T03:30:49 1747884649

Community scale cache, sounds like a library (the bricks and mortar kind)

handsclean · 2025-05-22T08:14:34 1747901674

https://conwaylife.com/wiki/HashLife is an algorithm for doing basically this in Conway’s Game of Life, which is Turing complete. I remember my first impression being complete confusion: here’s a tick-by-tick simulation too varied and complex to encapsulate in a formula, and you’re telling me I can just skip way into its future?

RetroTechie · 2025-05-22T14:17:42 1747923462

If I read that page correctly, it does this for areas with empty space between them?

Makes sense. Say you have a pattern (surrounded by empty space) that 'flickers': A-B-A-B-A... etc. Then as long as nothing intrudes, nth generation is the same pattern as in n+1000,000th generation. Similar for patterns that do a 3-cycle, 4-cycle etc.

All you'd need is a) a way to detect repeating patterns, and b) do some kind of collision detection between areas/patterns (there's a thing called 'lightspeed' in Life, that helps).

handsclean · 2025-05-23T17:22:15 1748020935

I don’t fully understand the algorithm, but no, to my understanding it’s much more general than that. In each tick a cell’s state is solely determined by its immediate neighbors, which means the simulation has a “speed of light” of 1 cell/second: to look N ticks into the future, you need only consider cells within N cells of the area you’re computing, no matter what’s outside that. So, for example, if you want to skip a 10x10 area 100 ticks into the future, you consider a 210x210 area centered on your 10x10, compute it once, then in the future use that 210x210 area as a lookup key for the 10x10 100 ticks into the future. I think HashLife is also somehow doing this on multiple scales at once, and some other tricks.

jsnider3 · 2025-05-22T02:46:38 1747881998

> In fact I don’t think we would need processors anymore if we were centrally storing all of the operations ever done in our processors.

On my way to memoize your search history.