Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Anybody knows if one can find an inference provider that offers input token caching? It should be almost required for agentic use - first speed, but also almost all conversations start where the previous ended, so cost may end up quite higher with no caching.

I would have expected good providers like Together, Fireworks, etc support it, but I can't find it, except if I run vllm myself on self-hosted instances.



Alibaba Cloud does: > Supported models. Currently, qwen-max, qwen-plus, qwen-turbo, qwen3-coder-plus support context cache.


I know. I cannot believe lm studio. Ollama. Especially model providers, do not offer this yet.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: