Not GP, but I would imagine "another checker to scan the results" would be another NN classifier.
Thinking being that you'd compare outputs of the two, and under assumption of the results being statistically independent from each other and of similar quality, say 1% difference between the two in said comparison, would suggest ~ 0.5% error rate from "ground truth".
If you already have the answers to verify the LLM output against why not just use those to begin with?