Hacker News new | past | comments | ask | show | jobs | submit login

Looks like Nolano.org's "cformers" includes a fork of llama.cpp/ggml by HCBlackFox that supports the GPT-NeoX architecture that powers EleutherAI's Pythia family of open LLMs (which also powers Databrick's new Dolly 2.0), as well as StabilityAI's new StableLM.

I quantized the weights to 4-bit and uploaded it to HuggingFace: https://huggingface.co/cakewalk/ggml-q4_0-stablelm-tuned-alp...

Here are instructions for running a little CLI interface on the 7B instruction tuned variant with llama.cpp-style quantized CPU inference.

    pip install transformers wget
    git clone https://github.com/antimatter15/cformers.git
    cd cformers/cformers/cpp && make && cd ..
    python chat.py -m stability
That said, I'm getting pretty poor performance out of the instruction tuned variant of this model. Even without quantization and just running their official Quickstart, it doesn't give a particularly coherent answer to "What is 2 + 2"

    This is a basic arithmetic operation that is 2 times the result of 2 plus the result of one plus the result of 2. In other words, 2 + 2 is equal to 2 + (2 x 2) + 1 + (2 x 1).



llama.cpp has preliminary support already. https://github.com/ggerganov/llama.cpp/issues/1063#issuecomm...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: