The issue is Job's training data is likely 99% his public "presentation voice" audio -- cadence, inflection, emphasis from public remarks at Apple events, commencement addresses, shareholder meetings, etc -- which OF COURSE sounds unnatural in regular conversation.
Meanwhile Rogan has million hours of regular conversation audio to learn from.
Humans are expensive though. If you have a lot of speech to record, it might be cheaper to use the human to train the AI and then let the AI finish the rest.
Would the fact that Joe's data is more standardized and produced the same way. Job's data is likely a mix of different volumes, echo levels, processing have an effect
Meanwhile Rogan has million hours of regular conversation audio to learn from.