From what my results indicate, they stop the ads from running. So you wouldn't see ads that were incorrectly labeled, you WOULDN'T see ads that should be running but were incorrectly flagged as needing a political affiliation label.
And from the looks of the Yoga ads, there are hundreds that have been flagged/paused as needing to disclose they are political (when they actually aren't.)
Anecdotally someone may have evidence of poor performance. Assigning a specific number to its statistical probability based on that anecdotal evidence is where the problem lies.