I tried the new qwen model in Codex CLI and in Roo Code and I found it to be pretty bad. For instance I told it I wanted a new vite app and it just started writing all the files from scratch (which didn’t work) rather than using the vite CLI tool.
Is there a better agentic coding harness people are using for these models? Based on my experience I can definitely believe the claims that these models are overfit to Evals and not broadly capable.
I've noticed that open weight models tend to hesitate to use tools or commands unless they appeared often in the training or you tell them very explicitly to do so in your AGENTS.md or prompt.
They also struggle at translating very broad requirements to a set of steps that I find acceptable. Planning helps a lot.
Regarding the harness, I have no idea how much they differ but I seem to have more luck with https://pi.dev than OpenCode. I think the minimalism of Pi meshes better with the limited capabilities of open models.
+1 to this, anecdotally I’ve found in my own evaluations that if your system prompt doesn’t explicitly declare how to invoke a tool and e.g. describe what each tool does, most models I’ve tried fail to call tools or will try to call them but not necessarily use the right format. With the right prompt meanwhile, even weak models shoot up in eval accuracy.
Have frontier lab do the plan which is the most time consuming part anyways and then local llm do the implementation.
Frontier model can orchestrate your tickets, write a plan for them and dispatch local llm agents to implement at about 180 tokens/s, vllm can probably ,manage something like 25 concurrent sessions on RTX 6000
Do it all in a worktrees and then have frontier model do the review and merge.
I am just a retired hobbyist but that's my approach, I run everything through gitea issues, each issue gets launched by orchestrator in a new tmux window and two main agents (implementer and reviewer get their own panes so I can see what's going on). I think claude code now has this aspect also somewhat streamlined but I have seen no need to change up my approach yet since I am just a retired hobbyist tinkering on my personal projects. Also right now I just use claude code subagents but have been thinking of trying to replace them with some of these Qwen 3.5 models because they do seem cpable and I have the hardware to run them.
In my experience Qwen3.5/Qwen3-Coder-Next perform best in their own harness, Qwen-Code. You can also crib the system prompt and tool definitions from there though. Though caveat, despite the Qwen models being the state of the art for local models they are like a year behind anything you can pay for commercially so asking for it to build a new app from scratch might be a bit much.
It was originally the eternally-on-the-horizon Semantic Web, before somebody decided to reuse the name into something to do with crypto (perhaps without bothering to search for "web 3" beforehand)
Thinking about what Jony Ive said about “owning the unintended consequence” of making screens ubiquitous, and how a voice controlled, completely integrated service could be that new computing paradigm Sam was talking about when he said “ You don’t get a new computing paradigm very often. There have been like only two in the last 50 years. … Let yourself be happy and surprised. It really is worth the wait.”
I suspect we’ll see stronger voice support, and deeper app integrations in the future. This is OpenAI dipping their toe in the water of the integrations part of the future Sam and Jony are imagining.
And remember Quibi [1]? Short-form video in vertical format specifically for mobile devices? They didn't have every aspect nailed, but they were definitely trailblazers on that front.
Quibi launched in April 2020. TikTok by this point would have 2 billion downloads [1]. It's difficult to assess they were trailblazers here. I might even say a component of their failure is free mobile video was widely accessible by this point.
Didn't have every aspect nailed? Definitely trailblazers? Quibi is a prime example of an absolute business wipeout. They got a bunch of investor money together, showed no interest in what viewers actually want, and then went down in flames immediately upon public release of the product. The whole thing was a disaster that didn't accomplish anything beyond putting a bunch of capital in the pockets of C grade C suite players.
I thought the kept incentivizing longer content so they could cram more ads into the videos. Hard to get some one to watch a 20 second ad for a 2 minute video, but if you can convince everyone to pad that thing up to 10 minutes you could stuff at least 2 ads in there.
Is there a better agentic coding harness people are using for these models? Based on my experience I can definitely believe the claims that these models are overfit to Evals and not broadly capable.
reply