I have an idea how to test whether AI can be a good scientist:
Train on all published scientific knowledge and observations up to certain point, before a breakthrough occurred. Then see if your AI can generate the breakthrough on its own.
For example, prior to 1900 quantum theory did not exist. Given what we knew then, could AI reproduce the ideas of Planck, Einstein, Bohr etc?
If not, then AI will never be useful for generating scientific theory.
I don’t think this is the main point of the paper. They’re not claiming that AI is capable of scientific breakthroughs. Rather, they argue that AI excels at summarising vast amounts of existing scientific knowledge.
Or just have the AI generate new specific experimental setups and parameters that we can try and be like "oh yeah, we just made a room temperature superconductor".
Honestly given what we know about physics, the AI should be able to simulate physics within itself or deduce certain things we've missed.
i only performed a quick read of the paper but couldn't find how many humans they used to generate their expected human performance, this seems to be the main content:
> To ensure that we did not overfit PaperQA2 to achieve high performance on LitQA2, we generated a new set of 101 LitQA2 questions after making most of the engineering changes to PaperQA2. The accuracy of PaperQA2 on the original set of 147 questions did not differ significantly from its accuracy on the latter set of 101 questions, indicating that our optimizations in the first stage generalized well to new and unseen LitQA2 questions (Table 2).
> To compare PaperQA2 performance to human performance on the same task, human annotators who either possessed a PhD in biology or a related science, or who were enrolled in a PhD program (see Section 8.2.1), were each provided a subset of LitQA2 questions and a performance-related financial incentive of $3-12 per question to answer as many questions correctly as possible within approximately one week, using any online tools and paper access provided by their institutions. Under these conditions, human annotators achieved 73.8% ± 9.6% (mean ± SD, n = 9) precision on LitQA2 and 67.7% ± 11.9% (mean ± SD, n = 9) accuracy (Figure 2A, green line). PaperQA2 thus achieved superhuman precision on this task (t(8.6) = 3.49, p = 0.0036) and did not differ significantly from humans in accuracy (t(8.5) = −0.42, p = 0.66).
Academic writing is notoriously hard to read and often poorly written. If this lives up to billing it will be a game changer - no need to rely on sporadic, manual, intrinsically limited nature of surveys from academics, analysts through to gym bros, reddit posters.
One of my big uses of LLM's has been searching through medical research. The issue has been a few times running into confidence where it shouldn't be but I have found it hallucinates a lot less in science topics than it does for more common topics.
1. Shove all potentially relevant data (e.g. entire book or library) into an LLM (quite expensive and the needle-haystack problem exists even in recent models iirc -- though splitting into many smaller prompts seems to solve it without substantially increasing price).
2. Vector database (in my experience overcomplicated and spotty performance, often not much better than an expanded keyword search, sometimes worse?)
3. Web search (generate queries, run them on duckduckgo, read the top N results) -- decent in theory but most top search results are crap, I need to adapt this method to use only high quality sources instead of general purpose search engines.
Extremely dangerous for that one detail you will not expect also since hallucination becomes more rare; extremely dangerous in the hands of practitioners non paranoid in front of hallucination.
(Incidentally: apparently, somebody recently lost a house after having some chatbot write the contract. This is indicative of the possible level of carelessness of users.
Edit: I am trying to find that piece of news, but it seems non trivial. Maybe the original reference itself, which reported the news, was victim of hallucination? Meanwhile, I have found this noteworthy piece: firm allows people to have contractual terms presented by chatbot, which presents hallucinated terms, and loses legal action - https://mashable.com/article/air-canada-forced-to-refund-aft... )
Train on all published scientific knowledge and observations up to certain point, before a breakthrough occurred. Then see if your AI can generate the breakthrough on its own.
For example, prior to 1900 quantum theory did not exist. Given what we knew then, could AI reproduce the ideas of Planck, Einstein, Bohr etc?
If not, then AI will never be useful for generating scientific theory.
reply