Oh I think it should be improved for sure. I just think this is a bad example. I think most of the fact checking can be done using any modern information retrieval system and you can build algorithms that will regenerate answers until they’re factually correct, or use the IR to hint the answer to correctness. We also have very powerful semantic inference engines and other tools that complement LLM output. I think judging the possibilities by the beta is simplistic, and folks are unfairly down on the achievement by picking nits.