Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wonder what information theory can tell us here.


The entropy of English is ~1 bit per letter (Measured in a funny way by Shannon - https://cs.stanford.edu/people/eroberts/courses/soco/project...)

In general, it's a bit funny in how we build ML models. You take a bucket of matrices, fill them with random numbers, and then start to introduce "biases" through back propagation. A model can converge down loss, but most of them are still filled with random noise.

Binary weights are somewhat obvious in retrospect. Weights indicate the strength of an association between two neurons. Intuitively, most of the value is probably just that an association between two neurons exists, and whether it's positive or negatively associated.


Wow, fascinating read. 1999

Would a monkey who knew the n-gram frequencies of the letters in English where n is large be able to produce credible English text? Furthermore, does this monkey "know" English? If the N-gram monkey is behind one door and a human is behind the other, could a third-party observer tell which was the monkey? This question rings of the Turing test for artificial intelligence, to which there is no easy answer.

No easy answer indeed!


Down to one bit but that’s taking the 2.62 bits and then applying the redundancy factor.

What’s cool is that the differentiable activation function is important—-to avoid the linear, perceptron limitation—but the weight scaling can be so simple, at least for LLMs.

It makes me wonder whether the extra layers are effectively compensating; in other words, can the number of layers or hidden neurons be trimmed down if we then add more bits to each weight and still see equivalent effectiveness?


You can just delete whole layers, because the point of residual layers is to make the model learn the hyperparameter "layer count" automatically. Compare this to the absence of residual layers where the model must use all layers. Then you will have to get the layer count perfect, but this isn't possible, since each data point might benefit from a different layer count. The extra layers therefore exist primarily so the model becomes robust to poorly chosen hyper parameters. You still need a minimum amount of layers, but that isn't the problem.

https://arxiv.org/abs/2403.17887


I'm surprised that 1-bit works. Trinary (-1, 0, 1) makes sense, but if you only have 0 and 1, the asymmetry ought to be a problem. Or is multiplying an XOR, so 1x1 = 0?


It seems like they have learned floating point parameters x and y so that dequantized(bit 0) = x and dequantized(bit 1) = y. Thus there is no built in asymmetry. Or more precisely they learned a zero point and a scale but it's equivalent to this simpler model in the 1 bit case.

It still seems like there would be a problem because either [x, y] looks like [0ish, 1ish] and you can't have negative weights, or [x, y] looks like [-1ish, 1ish] and you can't have "don't care" weights. But if you have some redundancy in your neurons I guess this is acceptable because you can just cancel out the positive contribution from a neuron you don't care about with a negative contribution from a very similar neuron that you also don't care about.


It's more like, addition is an XOR. (In fact, AND as mult and XOR as add are GF(2)). Throw in NOT (or, really, just a constant 1) and you can compute any circuit.

Biologically, inhibitory neurons are every bit as important as excitatory ones, so if you squint just right, XOR looks like a neuron's activation being inhibited by another presynaptic neuron.


There's a scale and bias afterwards, so its not necessarily asymmetric.


Yeah basically. In binary, multiplication is an XNOR operation.

00 = 1

01 = 0

10 = 0

11 = 1


XNOR does not distribute over AND or any other binary operators (try 0 XNOR (0 AND 1)), nor does it have a multiplicative identity, so it's not really multiplication in a way that's useful.


0*0 = 1? I've always hear it as being the output of AND


For balanced binary [-1, 1] yes. -1*-1=1


That makes sense, thanks!


Neural networks work, but why they work is not well known. Researchers continue to find new “free lunches” all the time.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: