I liked that you link to renting a dual 24GPU for 0.60cents/hour, but how long c...

airstrike · on March 8, 2024

> Now that I have a total of 23 points in HN, I will change my password to some random one, just to cure my desire to look for votes, and try to make some work, and again some tomorrow create a new presence in HN.

If you use Stylus (or any similar browser extension), I actually wrote a style to hide points for that very reason, replacing karma and scores with `•••`

This is actually the second time I see someone mentioning this need, so I've made it into a gist and published it to userstyles, but here's it is also since it's pretty short:

    @-moz-document domain("news.ycombinator.com") {
        /* Hide karma and points on replies */
        span.pagetop #karma, span.comhead span.score {
            visibility: hidden;
            position: relative;
            display: inline-block;
            height: 10px !important;
            overflow: hidden;
        }
        span.pagetop #karma {
            width: 0.8rem !important;
        }
        span.comhead span.score {
            width: 0.8rem !important;
        }
        span.pagetop #karma::before, span.comhead span.score::before {
            content: "•••";
            visibility: visible;
            overflow: hidden;
            opacity: 0.8;
            font-family: Helvetica, Arial, sans-serif !important;
        }
    }

https://gist.github.com/airstrike/62584e6ffb6104791c0ae48a8e...

https://userstyles.world/style/15164/hackernews-hide-karma-a...

SV_BubbleTime · on March 9, 2024

I wish this was built in but understand the intentional abusive psychological exploit that it isn’t.

danielhanchen · on March 8, 2024

On how long, finetuning is influenced by your dataset size (more = slower), sequence length since attention is O(N^2), data movement etc and most important is how many steps you want to take. For QLoRA, some runs can do a few hundred steps which can complete in minutes to 1 hour. Too many can overfit. So being able to fit it on consumer GPUs can be very cost effective.

On the 1.58bit paper, from what I understand, this requires a total retraining from scratch. Hopefully the researchers will open source their weights :)

On the technicals, weights are encoded in (-1, 0, 1), whilst QLoRA uses a 4bit dynamic mapping of 16 numbers. The only change required would be the torch.matmul(X, W) step, where it'll be torch.bitlinear_matmul(X, W). Before with QLoRA, one has to do torch.matmul(X, dequantize(W)). So one has to implement torch.bitlinear_matmul. The backward is torch.bitlinear_matmul(dY, W.T).

miohtama · on March 8, 2024

What's the magic in 1.58bit vs. 4 bit that it makes it so much more efficient (claimed)?

danielhanchen · on March 8, 2024

From what I understand, using (-1, 0, 1) removes multiplications in GPUs. Ie assume you have a weight matrix and multiply it by some activations

                   [-1, 0,  1]

                   [0,  1, -1]

    [10, 20, 30] x [1,  1,  0]

Instead of doing 10(-1) + 20(0) + 30(1) + 10(0) + ..., since we know beforehand the weights are simply (-1, 0, 1), we easily flip the sign and do addition, or force the hardware to do addition ie if (-1) do subtraction. If (0) do addition. If (1) do addition.

Floating point multiplication does addition of the exponents and multiplying of the mantissa. So just simplifying:

Float16 has E=5, M=10. Ie around 5 + 10^2 space needed = 105.

Bfloat16 has E=8, M=7. So 8 + 7^2 = 57 space.

Float8(143) E=4, M=3. So 4 + 3^2 = 13 space.

1.58(16bit) E=5, M=10. Addition only, so shift E say 5 + 10 addition = 15.

1.58(8bit) E=4, M=3. Addition only, so shift E say 4 + 3 addition = 7.

Obviously I'm simplifying, but with only additions, 1.58 uses say 7 space, whilst FP8 uses 13 space, so in theory 2x more transistors can be crammed, ie 2x more FLOPs than FP8.

nyrikki · on March 8, 2024

Really simple explanation is that for inference, feed forward networks are threshold circuits and by their nature ANNs are binary output, outputting true and false (same as being a threshold circuit)

So if you train your models with that in mind you're weighs can be reduced to -1,0,1 reducing the space complexity.

I don't think the costs in expressiveness are captured quite yet, but as perplexity doesn't care about correctness, if that is the metric that is important for you it will probably reduce memory requirements for inference.

chessgecko · on March 8, 2024

also just to add, I think the 1.58 bit is mostly faster for inference because training still had to multiply a lot of floating point gradients by integer activations, hold floating point weights/gradients for round, and deal with norms and stuff. could be wrong about that though

spywaregorilla · on March 9, 2024

> Edited: Now that I have a total of 23 points in HN, I will change my password to some random one, just to cure my desire to look for votes, and try to make some work, and again some tomorrow create a new presence in HN.

The irony of making an unnecessary edit like this to virtue signal for implicit social currency by shitting on the explicit form.

jph00 · on March 8, 2024

As mentioned in the post, benchmarking results are coming in a later post. But in short: you can train an epoch of Alpaca in 24 hours or so, which is enough to get very significant change in model behavior.

gumby · on March 9, 2024

> the recent (-1,0,1) encoding?

A side point, but this "recent" encoding goes back to a 2017 paper from the Allen Institute. These days a seven year old paper is ancient.

They went further and showed you could could get away with binary, you don't even need trinary!

paul_mk1 · on March 9, 2024

Goes back before then. This got popularized by BinaryConnect in 2015, and groups were training binary networks as early as 2011.

You are probably referring to XNOR net, and the novel piece there was also using binary activations (which bitnet is not).

So as far as I can tell, bitnet is basically BinaryConnect applied to LLMs.

https://arxiv.org/abs/1511.00363

gumby · on March 10, 2024

Thanks for your informative comment. What HN is for!

mmoskal · on March 9, 2024

The bitnet paper was showing worse results than fp16 transformer with the same parameter count. The shocking result in the 1.58b paper (same group) is no quality loss compared to fp16.

throwaway14356 · on March 9, 2024

i think those tables could be a facinating product. All parties involved could purchase them for private and public use.

P.S. I thought one was suppose to spend the HN points on mocking north-americans, shameless self-promotion, unpopular facts, general trolling and complaints about topics existing. I could go on but I haven't the points.

swader999 · on March 8, 2024

I like how you think about social media.