I wonder what information theory can tell us here.

rapatel0 · on March 29, 2024

The entropy of English is ~1 bit per letter (Measured in a funny way by Shannon - https://cs.stanford.edu/people/eroberts/courses/soco/project...)

In general, it's a bit funny in how we build ML models. You take a bucket of matrices, fill them with random numbers, and then start to introduce "biases" through back propagation. A model can converge down loss, but most of them are still filled with random noise.

Binary weights are somewhat obvious in retrospect. Weights indicate the strength of an association between two neurons. Intuitively, most of the value is probably just that an association between two neurons exists, and whether it's positive or negatively associated.

mlsu · on March 29, 2024

Wow, fascinating read. 1999

Would a monkey who knew the n-gram frequencies of the letters in English where n is large be able to produce credible English text? Furthermore, does this monkey "know" English? If the N-gram monkey is behind one door and a human is behind the other, could a third-party observer tell which was the monkey? This question rings of the Turing test for artificial intelligence, to which there is no easy answer.

No easy answer indeed!

nobodyandproud · on March 29, 2024

Down to one bit but that’s taking the 2.62 bits and then applying the redundancy factor.

What’s cool is that the differentiable activation function is important—-to avoid the linear, perceptron limitation—but the weight scaling can be so simple, at least for LLMs.

It makes me wonder whether the extra layers are effectively compensating; in other words, can the number of layers or hidden neurons be trimmed down if we then add more bits to each weight and still see equivalent effectiveness?

imtringued · on March 30, 2024

You can just delete whole layers, because the point of residual layers is to make the model learn the hyperparameter "layer count" automatically. Compare this to the absence of residual layers where the model must use all layers. Then you will have to get the layer count perfect, but this isn't possible, since each data point might benefit from a different layer count. The extra layers therefore exist primarily so the model becomes robust to poorly chosen hyper parameters. You still need a minimum amount of layers, but that isn't the problem.

https://arxiv.org/abs/2403.17887

Animats · on March 29, 2024

I'm surprised that 1-bit works. Trinary (-1, 0, 1) makes sense, but if you only have 0 and 1, the asymmetry ought to be a problem. Or is multiplying an XOR, so 1x1 = 0?

gpm · on March 29, 2024

It seems like they have learned floating point parameters x and y so that dequantized(bit 0) = x and dequantized(bit 1) = y. Thus there is no built in asymmetry. Or more precisely they learned a zero point and a scale but it's equivalent to this simpler model in the 1 bit case.

It still seems like there would be a problem because either [x, y] looks like [0ish, 1ish] and you can't have negative weights, or [x, y] looks like [-1ish, 1ish] and you can't have "don't care" weights. But if you have some redundancy in your neurons I guess this is acceptable because you can just cancel out the positive contribution from a neuron you don't care about with a negative contribution from a very similar neuron that you also don't care about.

scarmig · on March 29, 2024

It's more like, addition is an XOR. (In fact, AND as mult and XOR as add are GF(2)). Throw in NOT (or, really, just a constant 1) and you can compute any circuit.

Biologically, inhibitory neurons are every bit as important as excitatory ones, so if you squint just right, XOR looks like a neuron's activation being inhibited by another presynaptic neuron.

edflsafoiewq · on March 29, 2024

There's a scale and bias afterwards, so its not necessarily asymmetric.

rapatel0 · on March 29, 2024

Yeah basically. In binary, multiplication is an XNOR operation.

00 = 1

01 = 0

10 = 0

11 = 1

scarmig · on March 29, 2024

XNOR does not distribute over AND or any other binary operators (try 0 XNOR (0 AND 1)), nor does it have a multiplicative identity, so it's not really multiplication in a way that's useful.

zamadatix · on March 29, 2024

0*0 = 1? I've always hear it as being the output of AND

overengineer · on March 30, 2024

For balanced binary [-1, 1] yes. -1*-1=1

zamadatix · on March 30, 2024

That makes sense, thanks!

ttul · on March 29, 2024

Neural networks work, but why they work is not well known. Researchers continue to find new “free lunches” all the time.