Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In fairness it’s become even more difficult now than ever before.

* hardware spec

* inference engine

* specific model - differences to tokenizer will make models faster/slower with equivalent parameter count

* quantization used - and you need to be aware of hardware specific optimizations for particular quants

* kv cache settings

* input context size

* output token count

This is probably not a complete list either.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: