I've found that they struggle with understanding time and dates, and are sometimes weird about numbers. I asked Grok to guess the likelihood of something happening, and it gave me percentages for that day, the next day, the next week, and so on. Good enough. But the next day it was still predicting a 5-10% chance of the thing happening the previous day. I had to explain to it that the percentage for yesterday should now be 0%, since it was in the past.
In another example, I asked it to turn one of its bullet-point answers into a conversational summary that I could turn into an audio file to listen to later. It kicked out something that converted into about 6 minutes of audio, so I asked if it could expand on the details and give me something about 20 minutes. It kicked out a text that made about 7 minutes. So I explained that that was X words and only lasted 7 minutes, so I needed about 3X words. It kicked out about half that but claimed it was giving me 3X words or 20 minutes.
It's the little stuff like that that makes me think that, no matter how useful it might be for some things, it's a long way from being able to just hand it tasks and expect them to be done as reliably as a fairly dim human intern. If an intern kept coming up with half the job I asked for, I'd assume he was being lazy and let him go, but these things are just dumb in certain odd ways.
This is similar to many experiences I've had with LLM tools as well; the more complex and/or multi-step the task, the less reliable they become. This is why I object to the "graduate-level" label that Sam Altman et al. use. It fundamentally misrepresents the skill pyramid that makes a researcher (or any knowledge worker) effective. If a researcher can't reliably manage a to-do list, they can't be left unsupervised with any critical tasks, despite the impressive amount of information they can bring to bear and the efficiency with which they can search the web.
That's fine, I get a lot of value out of AI tooling between ChatGPT, Cursor, Claude+MCP, and even Apple Intelligence. But I have yet to use an agent that has come close to the capabilities that AI optimists claim with any consistency.
In another example, I asked it to turn one of its bullet-point answers into a conversational summary that I could turn into an audio file to listen to later. It kicked out something that converted into about 6 minutes of audio, so I asked if it could expand on the details and give me something about 20 minutes. It kicked out a text that made about 7 minutes. So I explained that that was X words and only lasted 7 minutes, so I needed about 3X words. It kicked out about half that but claimed it was giving me 3X words or 20 minutes.
It's the little stuff like that that makes me think that, no matter how useful it might be for some things, it's a long way from being able to just hand it tasks and expect them to be done as reliably as a fairly dim human intern. If an intern kept coming up with half the job I asked for, I'd assume he was being lazy and let him go, but these things are just dumb in certain odd ways.