That’s a great point! While we do our best to simulate an identical case, if the agent responds differently, our focus is on whether the key evaluator or goal for that replay set passes or fails. We use that as the source of truth and flag the exact moment where the conversation diverges from the expected flow.