Anybody knows if one can find an inference provider that offers input token cach... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		veselin 4 months ago \| parent \| context \| favorite \| on: Qwen3-Coder: Agentic coding in the world Anybody knows if one can find an inference provider that offers input token caching? It should be almost required for agentic use - first speed, but also almost all conversations start where the previous ended, so cost may end up quite higher with no caching. I would have expected good providers like Together, Fireworks, etc support it, but I can't find it, except if I run vllm myself on self-hosted instances.

gianpaj 4 months ago | [–]

Alibaba Cloud does: > Supported models. Currently, qwen-max, qwen-plus, qwen-turbo, qwen3-coder-plus support context cache.

zackify 4 months ago | [–]

I know. I cannot believe lm studio. Ollama. Especially model providers, do not offer this yet.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact