Hacker News new | past | comments | ask | show | jobs | submit login
GPU Embedding with GGML (bloop.ai)
123 points by louiskw on Oct 16, 2023 | hide | past | favorite | 33 comments



Interesting...I delivered a cross-platform library including MiniLM V2 last week, I'm seeing 77 embeddings/sec versus author's 20, but I'm also running on the CPU instead of GPU and M2 Max in MacBook Pro with 64 GB of RAM vs. what I think is M2 in Mac Mini with 8 GB RAM. (benchmarks: https://github.com/Telosnex/fonnx#benchmarks)

I really appreciate what GGML does but I have this irrational feeling that it'd be...unpredictable / not suited for production IMHO. (see comments here https://news.ycombinator.com/item?id=37900083 and here https://news.ycombinator.com/item?id=37898979)

I should reevaluate, this is some deep work and I assume you didn't see anything that scared you off, other than being locked to Macs for now.


I'll plug the project I've started work on along the same lines - a simplified CPU focused embedding model (right now with distillbert) that's coded as a single file with no dependencies and no abstraction. https://github.com/rbitr/ferrite


Pretty interesting, and rare to see projects in the space written in Fortran. Would be interested to hear what the advantages are, and appreciate if it's simply a case of personal preference too!


I'm mostly a python programmer, but I find a lot of the ML frameworks are overkill for what they actually do, especially for inference. Fortran is pretty close to numpy - it handles arrays natively including slicing and matmul instinsics, you don't have to worry about memory etc. But it compiles into something fast and lightweight much more easily than python. It's nothing you couldn't do in C but I think Fortran is better suited for linear algebra.

See also https://github.com/rbitr/llama2.f90 which is basically the same thing but for running llama models and has 16-bit and 4-bit options and a lot more optimization.


Is there a easy to read guide somewhere that maps all these terms and models and when do use what and which format ? GGUF , GGML , CPU vs GPU vs llama vs quant model. Its really confusing to try to figure out what model, based on hardware, which format to use. 4 bit vs 8 bit.


i don't think such a guide exists. this space is moving pretty fast. a short rundown

quantized model formats:

- GGML: used with llama.cpp, outdated, support is dropped or will be soon. cpu+gpu inference

- GGUF: "new version" of the GGML file format, used with llama.cpp. cpu+gpu inference. offers 2-8bit quantization

- GPTQ: pure gpu inference, used with AutoGPTQ, exllama, exllamav2, offers only 4 bit quantization

- EXL2: pure gpu inference, used with exllamav2, offers 2-8bit quantization

here[1] is a nice overview of VRAM usage vs perplexity of different quant levels (with the example of a 70b model in exl2 format)

[1] https://old.reddit.com/r/LocalLLaMA/comments/178tzps/updated...


Worth clarifying that GGML the library is very much active. GGML as a file format will be superseded by GGUF.


is everything ( for the most part ) a Llama model? does everything fork llama? is GGML part of llama? what is the relation of llama and mode formats. Is there an analogy? is GGML to react is to javascript? What is the differnence in GPT4all models vs llama.cpp vs ollama?

Thanks!


Everything (most llms and modern embedding models) is a transformer so the architecture is very similar. Llama(2) is a Meta (facebook) developed transformer plus the training they did on it.

Ggml is a "framework" like pytorch etc (for the purposes of this discussion) that lets you code up the architecture of a model, load in the weights that were trained, and run inference with it. Llama.cpp is a project that I'd describe as using ggml to implement some specific AI model architectures.


i am only dabbling in this space myself, so can't answer everything. all the formats i mentioned are for a quantized version of the original model. basically a lower resolution version, with the associated precision loss. e.g. original model weights are in f16, the gptq version is in int4. a big difference in size but often an acceptable loss of quality. using quants is basically a tradeoff between quality and "can i run it?".

examples of original models are llama(2), mistral, xwin. they are not directly related to any quantized versions. quants are mostly done by third parties (e.g. thebloke[1]).

using a full model for inference requires pretty beefy hardware. most inference on consumer hardware is done with quantized versions for that reason.

[1] https://huggingface.co/TheBloke


GGML is the framework for running deep neural network, mostly for interference. It's the same level as Pytorch or Tensorflow. So I would say GGML is the browser in your Javascript/React analogy.

llama.cpp is a project that uses GGML the framework under the hood, same authors. Some features were even developed in llama.cpp before being ported to GGML. Ollama provides a user-friendly way to uses llama models. No ideas what it uses under the hood.


The Llama name is pretty confusing at this point.

LLaMA was the model Facebook released under a non-commercial license back in February which was the first really capable openly available model. It drove a huge wave of research, and various projects were named after it (llama.cpp for example).

Llama 2 came out in July and allowed commercial usage.

But... there are increasing number of models now that aren't actually related to Llama at all. Projects like llama.cpp and Ollama can often be used to run those too.

So "Llama" no longer reliably means "related to Facebook's LLaMA architecture".


- GPTQ: pure gpu inference, used with AutoGPTQ, exllama, exllamav2, offers only 4 bit quantization

what is autoGTPTQ and exllama, what do it mean it only works with AutoGPTQ and exllama? Are those like TensorFlow Frameworks?


Ollama seems to be using a lot of the same, but as a really nice and easy to use wrapper for a lot of glue a lot of us would wind up writing anyway. It's quickly become my personal preference.

It looks to include submodules for GGML and GGUF from llama.cpp

https://github.com/jmorganca/ollama/tree/main/llm


The model discussed in the article is MiniLM-L6-v2, which you can run via PyTorch from the sentence-transformers project[1].

That model is based on BERT and not LLaMa [2].

[1]: https://www.sbert.net/docs/pretrained_models.html

[2]: https://huggingface.co/microsoft/MiniLM-L12-H384-uncased


I think you're still missing AWQ ones, which are a sort of GPTQ but with dynamic quantization depending on weight importance iirc?


GGML has been replaced with GGUF now and GGML is no longer getting any updates.

GPU offloading for GGUF/GGML has been available for quite a long time in Text Generation WebUI and works very well, but isn’t nearly as fast as GPTQ or the new AWQ format.


GGUF is the file format, but GGML is still the framework


Nice post. Also somewhat surprised Core ML doesn't have an implementation of GeLU since it's so widely used at this point (and not particularly challenging to re-implement, were I the speculative type..)


CoreML has had support[1] for GeLU since the iOS 15 days.

[1]: https://apple.github.io/coremltools/source/coremltools.conve...


ty


It's ONNX Runtime that's lacking support, per TFA, not CoreML.


My mistake, thanks


Isn't ggml supposed to be running on CPU? It seems that this can run over any GPU, not necessarily just NVIDIA


CPU is still the first-class option, but GGML also supports Metal, cuBLAS and clBLAST for hw acceleration


Would be great to support more GPU, Metal is the only graphics we support atm with CPU as a fallback


cuBLAS is cuda, clBLAST is OpenCL


I guess this another reason to comment code descriptively. So if your LLM is searching it has a better chance of seeing what something is doing and where it is.


It certainly helps, but one of the goals is to help you understand large badly documented/commented codebases


[dead]


GGML is the library, GGUF is the new GGML model format.


This is correct, GGML the library isn't going anywhere


This article is about embeddings, so are you indicating this can be done from Code Llama 13B?


Thats amazing! We have been doing very similar optimisation for clientvectorsearch.com

Great work!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: