For training how do you get any kind of meaningful derivative with it?

concurrentsquar · on March 31, 2024

You don't (you have to use real-valued inertial 'latent weights' during training): https://arxiv.org/abs/1906.02107

(there is still a reduction in memory usage though (just not 24x):

> "Furthermore, Bop reduces the memory requirements during training: it requires only one real-valued variable per weight, while the latent-variable approach with Momentum and Adam require two and three respectively.")

twelfthnight · on March 29, 2024

Maybe evolutionary algorithms instead? Hasn't proven super useful historically, but maybe at the scale of enormous LLMs it will be?

VHRanger · on March 30, 2024

Nope, they're orders of magnitude more inefficient because they don't leverage gradient descent.

Rule of thumb in optimization: real numbers are easy, integers are hard

markisus · on March 30, 2024

This may be the status quo because of the so called "hardware lottery" which has historically been optimized for floating point. I'm speculating, but if hardware designers were instead only concerned about raw xnor density and throughput, we might end up with chips powerful enough that giant 1-bit nets could be trained purely through evolution.

VHRanger · on March 30, 2024

No, it's a fact at the mathematical level that you can enshrine in big O terms if you want to

imtringued · on March 30, 2024

How do you optimize memory for floating point?

VHRanger · on March 30, 2024

BF8 and other similar formats?

bionhoward · on March 29, 2024

Evolutionary algorithms made you, didn’t they?

TheDudeMan · on March 29, 2024

That does not prove that they can beat gradient decent.

bick_nyers · on March 30, 2024

It took a lot of human brain flops to get to this point in time though, I wonder how many orders of magnitude more than it took to train ChatGPT...

bick_nyers · on March 30, 2024

Gradient-directed evolutionary algorithm sounds kinda interesting.

scotty79 · on March 30, 2024

Maybe something probabilistic?

chalst · on March 29, 2024

The OP explicitly excludes training.

cma · on March 30, 2024

The one I replied to said 1-bit for both training and inference.