The issue is Job's training data is likely 99% his public "presentation voice" audio -- cadence, inflection, emphasis from public remarks at Apple events, commencement addresses, shareholder meetings, etc -- which OF COURSE sounds unnatural in regular conversation.
Meanwhile Rogan has million hours of regular conversation audio to learn from.
Humans are expensive though. If you have a lot of speech to record, it might be cheaper to use the human to train the AI and then let the AI finish the rest.
Would the fact that Joe's data is more standardized and produced the same way. Job's data is likely a mix of different volumes, echo levels, processing have an effect
Probably significantly more training data for Rogan than Jobs and a much wider range thanks to his long running pod cast. I am not super familiar with Steve Jobs so I can't think of anything other than his keynotes and some interviews that you would be able to use for him.
Unrelated point...that laugh was incredibly bad and repetitive to the point it felt like they were playing laugh.wav file each time they wanted a laugh instead of generating a new laugh of variable pitch and length.
Maybe there's more training data available for Rogan. The guy pumps out hundreds of hours of content a year in which he's recorded discussing every topic under the sun. I can't imagine there's a similar quantity of recordings of Jobs's voice - or of almost anyone's voice for that matter.
Edit: four other people replied in the time it took me to type two sentences. I guess the answer is that obvious.
Presumably because we have hours, days, weeks of Joe Rogan speaking - not just on his podcast but as a sports announcer as well. Steve Jobs... we have a few speeches and presentations, but we don't have much data on how he spoke by comparison.