> But they mentioned there is still a lot of data in other modalities such as au...

> But they mentioned there is still a lot of data in other modalities such as audio, video. Probably the reason more focus on multimodal models and also synthetic datasets.

I think this is really interesting, but I wonder if there really is enough data there to make a qualitative difference. I'm sure there's enough to make a better model, but I'm hesitant to think it would be better than an improved chatbot. What people are really waiting for is a qualitative shift, not a just an improved GPT.

> It's a multistep process and probably agents and planning will be a next step for LLM.

I agree, we definitely need a new understanding here. Right now, with the architecture we have, agents just don't seem to work. In my experience, if the LLM doesn't figure it out with a few shots, trying over and over again with different tools/functions doesn't help.