> Again, and I keep repeating this but it seems to be ignored every time: this is experimental work and it's still in progress. This story of small group-sizes on large models should not be an issue.
Apologies if something I said (or I guess did not say...) offended you! It's a hypothetical, and one IME is not so easy to achieve, but maybe you have different results. So I didn't want to comment on this, maybe it's possible (but LLMs don't scale up as easily in terms of quantization than other networks like image classifiers, in my experience).
> The extreme quant buys you potentially 70x more efficient matmul via binary/ternary operations.
To be clear, such hardware does not yet exist, and it's unclear if you really can have more efficient binary/ternary matmul if you need high-precision accumulators and more frequent broadcasting shiftss. It's again a complicated hardware question to answer if the sum total latency of doing many high-precision accumulations and many scales/shifts will be smaller (or, chip-area-wise, even feasible to implement), compared to a 4-bit baseline.
Results may vary :)
> Again, and I keep repeating this but it seems to be ignored every time: this is experimental work and it's still in progress. This story of small group-sizes on large models should not be an issue.
Apologies if something I said (or I guess did not say...) offended you! It's a hypothetical, and one IME is not so easy to achieve, but maybe you have different results. So I didn't want to comment on this, maybe it's possible (but LLMs don't scale up as easily in terms of quantization than other networks like image classifiers, in my experience).
> The extreme quant buys you potentially 70x more efficient matmul via binary/ternary operations.
To be clear, such hardware does not yet exist, and it's unclear if you really can have more efficient binary/ternary matmul if you need high-precision accumulators and more frequent broadcasting shiftss. It's again a complicated hardware question to answer if the sum total latency of doing many high-precision accumulations and many scales/shifts will be smaller (or, chip-area-wise, even feasible to implement), compared to a 4-bit baseline.