Looks like Nolano.org's "cformers" includes a fork of llama.cpp/ggml by HCBlackFox that supports the GPT-NeoX architecture that powers EleutherAI's Pythia family of open LLMs (which also powers Databrick's new Dolly 2.0), as well as StabilityAI's new StableLM.
Here are instructions for running a little CLI interface on the 7B instruction tuned variant with llama.cpp-style quantized CPU inference.
pip install transformers wget
git clone https://github.com/antimatter15/cformers.git
cd cformers/cformers/cpp && make && cd ..
python chat.py -m stability
That said, I'm getting pretty poor performance out of the instruction tuned variant of this model. Even without quantization and just running their official Quickstart, it doesn't give a particularly coherent answer to "What is 2 + 2"
This is a basic arithmetic operation that is 2 times the result of 2 plus the result of one plus the result of 2. In other words, 2 + 2 is equal to 2 + (2 x 2) + 1 + (2 x 1).
I quantized the weights to 4-bit and uploaded it to HuggingFace: https://huggingface.co/cakewalk/ggml-q4_0-stablelm-tuned-alp...
Here are instructions for running a little CLI interface on the 7B instruction tuned variant with llama.cpp-style quantized CPU inference.
That said, I'm getting pretty poor performance out of the instruction tuned variant of this model. Even without quantization and just running their official Quickstart, it doesn't give a particularly coherent answer to "What is 2 + 2"