Hacker News new | past | comments | ask | show | jobs | submit login

I liked that you link to renting a dual 24GPU for 0.60cents/hour, but how long could it takes to fine tune a 70b model using your system (4 bits for weights)?

If I were a consumer I would be interested in the final price of fine tuning, for example a table with model size, training size, cost of training, and expected loss of quality with this technology.

One obvious question: Can you apply your technology with the recent (-1,0,1) encoding?, I think you will answers that the (-1,0,1) model is not available and you can't try it, but my question is whether once/if that model is available answer.ai will be able to use the same technology that this post to fine tune a big model in two very small GPUs, and then I should ask for a new table with cost/benefits analysis.

Edited: I should add that I find this kind of work very useful for enhancing individual users like me to be able to compete in the applications of LLM market, this is great work and along the lines of the book "from zero to one" (not that I like or dislike the author) to solve the kind of problem that nobody is trying to solve.

Edited: Now that I have a total of 23 points in HN, I will change my password to some random one, just to cure my desire to look for votes, and try to make some work, and again some tomorrow create a new presence in HN.




> Now that I have a total of 23 points in HN, I will change my password to some random one, just to cure my desire to look for votes, and try to make some work, and again some tomorrow create a new presence in HN.

If you use Stylus (or any similar browser extension), I actually wrote a style to hide points for that very reason, replacing karma and scores with `•••`

This is actually the second time I see someone mentioning this need, so I've made it into a gist and published it to userstyles, but here's it is also since it's pretty short:

    @-moz-document domain("news.ycombinator.com") {
        /* Hide karma and points on replies */
        span.pagetop #karma, span.comhead span.score {
            visibility: hidden;
            position: relative;
            display: inline-block;
            height: 10px !important;
            overflow: hidden;
        }
        span.pagetop #karma {
            width: 0.8rem !important;
        }
        span.comhead span.score {
            width: 0.8rem !important;
        }
        span.pagetop #karma::before, span.comhead span.score::before {
            content: "•••";
            visibility: visible;
            overflow: hidden;
            opacity: 0.8;
            font-family: Helvetica, Arial, sans-serif !important;
        }
    }

https://gist.github.com/airstrike/62584e6ffb6104791c0ae48a8e...

https://userstyles.world/style/15164/hackernews-hide-karma-a...


I wish this was built in but understand the intentional abusive psychological exploit that it isn’t.


On how long, finetuning is influenced by your dataset size (more = slower), sequence length since attention is O(N^2), data movement etc and most important is how many steps you want to take. For QLoRA, some runs can do a few hundred steps which can complete in minutes to 1 hour. Too many can overfit. So being able to fit it on consumer GPUs can be very cost effective.

On the 1.58bit paper, from what I understand, this requires a total retraining from scratch. Hopefully the researchers will open source their weights :)

On the technicals, weights are encoded in (-1, 0, 1), whilst QLoRA uses a 4bit dynamic mapping of 16 numbers. The only change required would be the torch.matmul(X, W) step, where it'll be torch.bitlinear_matmul(X, W). Before with QLoRA, one has to do torch.matmul(X, dequantize(W)). So one has to implement torch.bitlinear_matmul. The backward is torch.bitlinear_matmul(dY, W.T).


What's the magic in 1.58bit vs. 4 bit that it makes it so much more efficient (claimed)?


From what I understand, using (-1, 0, 1) removes multiplications in GPUs. Ie assume you have a weight matrix and multiply it by some activations

                   [-1, 0,  1]

                   [0,  1, -1]

    [10, 20, 30] x [1,  1,  0]
Instead of doing 10(-1) + 20(0) + 30(1) + 10(0) + ..., since we know beforehand the weights are simply (-1, 0, 1), we easily flip the sign and do addition, or force the hardware to do addition ie if (-1) do subtraction. If (0) do addition. If (1) do addition.

Floating point multiplication does addition of the exponents and multiplying of the mantissa. So just simplifying:

Float16 has E=5, M=10. Ie around 5 + 10^2 space needed = 105.

Bfloat16 has E=8, M=7. So 8 + 7^2 = 57 space.

Float8(143) E=4, M=3. So 4 + 3^2 = 13 space.

1.58(16bit) E=5, M=10. Addition only, so shift E say 5 + 10 addition = 15.

1.58(8bit) E=4, M=3. Addition only, so shift E say 4 + 3 addition = 7.

Obviously I'm simplifying, but with only additions, 1.58 uses say 7 space, whilst FP8 uses 13 space, so in theory 2x more transistors can be crammed, ie 2x more FLOPs than FP8.


Really simple explanation is that for inference, feed forward networks are threshold circuits and by their nature ANNs are binary output, outputting true and false (same as being a threshold circuit)

So if you train your models with that in mind you're weighs can be reduced to -1,0,1 reducing the space complexity.

I don't think the costs in expressiveness are captured quite yet, but as perplexity doesn't care about correctness, if that is the metric that is important for you it will probably reduce memory requirements for inference.


also just to add, I think the 1.58 bit is mostly faster for inference because training still had to multiply a lot of floating point gradients by integer activations, hold floating point weights/gradients for round, and deal with norms and stuff. could be wrong about that though


> Edited: Now that I have a total of 23 points in HN, I will change my password to some random one, just to cure my desire to look for votes, and try to make some work, and again some tomorrow create a new presence in HN.

The irony of making an unnecessary edit like this to virtue signal for implicit social currency by shitting on the explicit form.


As mentioned in the post, benchmarking results are coming in a later post. But in short: you can train an epoch of Alpaca in 24 hours or so, which is enough to get very significant change in model behavior.


> the recent (-1,0,1) encoding?

A side point, but this "recent" encoding goes back to a 2017 paper from the Allen Institute. These days a seven year old paper is ancient.

They went further and showed you could could get away with binary, you don't even need trinary!


Goes back before then. This got popularized by BinaryConnect in 2015, and groups were training binary networks as early as 2011.

You are probably referring to XNOR net, and the novel piece there was also using binary activations (which bitnet is not).

So as far as I can tell, bitnet is basically BinaryConnect applied to LLMs.

https://arxiv.org/abs/1511.00363


Thanks for your informative comment. What HN is for!


The bitnet paper was showing worse results than fp16 transformer with the same parameter count. The shocking result in the 1.58b paper (same group) is no quality loss compared to fp16.


i think those tables could be a facinating product. All parties involved could purchase them for private and public use.

P.S. I thought one was suppose to spend the HN points on mocking north-americans, shameless self-promotion, unpopular facts, general trolling and complaints about topics existing. I could go on but I haven't the points.


I like how you think about social media.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: