Agreed that it's not the best way to use agents right now (they still need supervision) but I think in the coming year(s) we'll reach a point where they'll be good enough to run on their own (see Codex).
If you're interested in working on this, we love to see new contributors in the Discord https://discord.gg/6xWPKhGDbA
We (an open community of AI researchers at Stanford, Anthropic, UW, and more) just released Terminal-Bench, a new open-source framework for evaluating how well AI agents perform in terminal environments. Given how much we all use the terminal and how many new AI terminal assistants are emerging, we wanted to create a rigorous way to test their capabilities.
What we found: The best commercial agents (using models like GPT-4, Claude, Gemini) score less than 20% on our benchmark tasks. Even with their impressive capabilities, these agents struggle with:
- Chaining multiple terminal commands together
- Reasoning over long command outputs
- Acting independently within sensible limits
- Executing tasks safely
What's in Terminal-Bench:
- Docker-containerized environments for consistent testing
- Hand-crafted tasks covering data science, networking, security, and more
- Human-verified solutions and test cases
- Support for different integration methods
Want to get involved? We're looking for contributors to help expand the benchmark with new challenging tasks. If you've got scenarios where current AI agents fail in the terminal, we'd love to include them!
If you're interested in working on this, we love to see new contributors in the Discord https://discord.gg/6xWPKhGDbA