I've seen similar results in physics. I suspect LLMs are capable of redirecting the user accurately when there have been long discussions on the web about that topic. When an LLM can pattern-match on whole discussions, it becomes a next-level search engine.
Next, I hope we can somehow get LLMs to distinguish between reliable and less-reliable results.
Next, I hope we can somehow get LLMs to distinguish between reliable and less-reliable results.