From the blog post: "A minority of the problems in the exams were seen by the model during training, but we believe the results to be representative—see our technical report for details." They have a chart where they broke out results for the model with versus without "vision" i.e. having trained on the exam questions before.