> they're performing at at least graduate student level across most tasks
I strongly disagree with this characterization. I have yet to find an application that can reliably execute this prompt:
"Find 90 minutes on my calendar in the next four weeks and book a table at my favorite Thai restaurant for two, outside if available."
Forget "graduate-level work," that's stuff I actually want to engage with. What many people really need help with is just basic administrative assistance, and LLMs are way too unpredictable for those use cases.
I've found that they struggle with understanding time and dates, and are sometimes weird about numbers. I asked Grok to guess the likelihood of something happening, and it gave me percentages for that day, the next day, the next week, and so on. Good enough. But the next day it was still predicting a 5-10% chance of the thing happening the previous day. I had to explain to it that the percentage for yesterday should now be 0%, since it was in the past.
In another example, I asked it to turn one of its bullet-point answers into a conversational summary that I could turn into an audio file to listen to later. It kicked out something that converted into about 6 minutes of audio, so I asked if it could expand on the details and give me something about 20 minutes. It kicked out a text that made about 7 minutes. So I explained that that was X words and only lasted 7 minutes, so I needed about 3X words. It kicked out about half that but claimed it was giving me 3X words or 20 minutes.
It's the little stuff like that that makes me think that, no matter how useful it might be for some things, it's a long way from being able to just hand it tasks and expect them to be done as reliably as a fairly dim human intern. If an intern kept coming up with half the job I asked for, I'd assume he was being lazy and let him go, but these things are just dumb in certain odd ways.
This is similar to many experiences I've had with LLM tools as well; the more complex and/or multi-step the task, the less reliable they become. This is why I object to the "graduate-level" label that Sam Altman et al. use. It fundamentally misrepresents the skill pyramid that makes a researcher (or any knowledge worker) effective. If a researcher can't reliably manage a to-do list, they can't be left unsupervised with any critical tasks, despite the impressive amount of information they can bring to bear and the efficiency with which they can search the web.
That's fine, I get a lot of value out of AI tooling between ChatGPT, Cursor, Claude+MCP, and even Apple Intelligence. But I have yet to use an agent that has come close to the capabilities that AI optimists claim with any consistency.
This is absolutely doable right now. Just hook claude code up with your calendar MCP server and any one of these restaurant/web browser MCP servers and it'll do this for you.
How reliable are the results? I can expect a human with graduate-level execution to get this right almost 100% of the time and adapt to unforeseen extenuating circumstances.
That's great to hear - do you know what success rate it might have? I've used scheduled tasks in ChatGPT and they fail regularly enough to fall into the "toy" category for me. But if Operator is operating significantly above that threshold, that would be remarkable and I'd gladly eat my words.
I strongly disagree with this characterization. I have yet to find an application that can reliably execute this prompt:
"Find 90 minutes on my calendar in the next four weeks and book a table at my favorite Thai restaurant for two, outside if available."
Forget "graduate-level work," that's stuff I actually want to engage with. What many people really need help with is just basic administrative assistance, and LLMs are way too unpredictable for those use cases.