LLMs are also heavily biased after chatbot tuning leads to mode-collapse. That's why you see the same verbal tics coming out of them, like the em-dashes or the 'twist ending' in the more recent 4os. And if LLMs really were unbiased, you'd expect better scaling when you tried to bruteforce code correctness. Training a 'test LLM' will just wind up inheriting a lot of the shared blindspots. They aren't independent of the implementation at all (just like humans are not independent, even when they didn't write the original, and didn't see it either; and this is why you can't simply throw _n_ programmers at a piece of code and be certain you got all the bugs, and why fuzzers will continue to rampage through code).