Looks like Nolano.org's "cformers" includes a fork of llama.cpp/ggml by HCBlackF...

Looks like Nolano.org's "cformers" includes a fork of llama.cpp/ggml by HCBlackFox that supports the GPT-NeoX architecture that powers EleutherAI's Pythia family of open LLMs (which also powers Databrick's new Dolly 2.0), as well as StabilityAI's new StableLM.

I quantized the weights to 4-bit and uploaded it to HuggingFace: https://huggingface.co/cakewalk/ggml-q4_0-stablelm-tuned-alp...

Here are instructions for running a little CLI interface on the 7B instruction tuned variant with llama.cpp-style quantized CPU inference.

    pip install transformers wget
    git clone https://github.com/antimatter15/cformers.git
    cd cformers/cformers/cpp && make && cd ..
    python chat.py -m stability

That said, I'm getting pretty poor performance out of the instruction tuned variant of this model. Even without quantization and just running their official Quickstart, it doesn't give a particularly coherent answer to "What is 2 + 2"

    This is a basic arithmetic operation that is 2 times the result of 2 plus the result of one plus the result of 2. In other words, 2 + 2 is equal to 2 + (2 x 2) + 1 + (2 x 1).