Look at the handbags hanging in the air next to people walking by in the background, and the six fingers on the interviewer's hand. But yes, I had to look at it repeatedly to see that it's fake, it's definitely good enough to fool most people scrolling their feed.
I don't follow it super close, but I've seen some videos from the Wan model that's now available in ComfyUI, and they can be really good.
I suspect it is a ControlNet style generation, driven by some real underlying footage. The nonverbal communication is just too good for what I've seen so far from pure AI generation.
Like you said, most of is not AI, my best guess is that somebody used AI to do inpainting to remove the captions and watermark. First time I've seen anything like it.
Maybe someone took a cropped video and used some AI video model to do out painting. I've found a few different versions of it using Google lens, some are vertical and show more of the image vertically but are more cropped horizontally, some have an Invideo AI water mark on them, but at different positions. I can't find any video on Invideo AI's page that is close to this kind of facial expressions and lip sync, so I also think at least the face is real video.
I don't follow it super close, but I've seen some videos from the Wan model that's now available in ComfyUI, and they can be really good.