You can pay for Claude API access (not normal Claude Pro) and wire in something like Cline via your API key, but it gets expensive fast in my experience.
the company i work for and actually most Swiss IT contractors have harsh rules, and more than half of our projects, we aren't allowed to use Github Copilot or pasting stuff to any LLM API.
For that matter I built a vLLM based local GPU machine for our dev squads as a trial. Currently using a 4070Ti Super with 16GB Vram and upgrading to 4x 4070Ti Super to support 70b models.
The difficulties we face IMHO:
- Cursor doesn't support WSL Devcontainers
- Small Tab-Complete models are more important, and there's less going on for those
- There's a huge gap between 7-14b and 120b models, not a lot of 70b models available
In reality, on 7-14b nothing beats Qwen2.5 for interactive coding and something around 2b for tab-completion
Question for those using it. Can the 7B really be used locally on a card with only 16GB VRAM? LLM Explorer says[1] it requires 15.4GB. That seems like cutting it close.
I am happily using qwen2.5-coder-7b-instruct-q3_k_m.gguf with a context size of 32768 on an RTX 3060 Mobile with 6GB VRAM using llama.cpp [2]. With 16GB VRAM, you could use qwen2.5-7b-instruct-q8_0.gguf which is basically indistinguishable from the fp16 variant.