I agree. I think the most telling sign is the hand gesture for the number 3 when he says "just invested 300 bucks". Don't think Deepfake models can understand intent yet.
1 - This is an actual video of the guy in his home, but they changed/synthesized the audio and then worked on the lip movements to make it match.
2 - This is a video of an actor impersonating the guy, possibly to the extent of impersonating his voice (although his timbre might make that a little tricky), and then they just deepfake the face on to the actor. An example of this is deeptomcruise on TikTok, something that you should treat yourself to if you haven't seen yet - https://www.tiktok.com/@deeptomcruise (first two today aren't great, here's a good one - https://www.tiktok.com/@deeptomcruise/video/7018171271095553... ? )
I'm not even convinced there's any alteration here, but even if there was both of the above could be possible. Adobe demoed something called VoCo in 2016 that never saw the light of day, not sure if there is something approximating this available today: https://www.youtube.com/watch?v=I3l4XLZ59iw&t=260s
Exactly, so the point is that any relevant gestures must be coming from the timing used in an existing video, the current deepfake tech can manipulate lips and facial expressions, but it can't have the video lift three fingers at the proper time when "300" is being said.
So this is either an indication of a very elaborate deepfake which managed to surface an amazingly coincidental source video (which should be possible to find on her archives) or that it's not a deepfake but a real recording.
Or: the base video is well chosen among the victim’s, the message is crafted as to not deviate a lot from the base, and the deep fake is used to have him say other things.
The reason for my skepticism: The state of the art language pronunciation from the best this planet has to offer still requires a full phoneme library recorded in studio conditions. A voice sample taken from a user’s instagram page doesn’t seem like the kind of source material that would be useful to make convincing speech.
I think you are right, if you find a similar person with similar voice (even commission them 5$ in fiverr) and only deep-fake the face in a low quality video, this is very much achievable in today's state of the art.
Pay attention to the hand. His index finger is turned inwards, that would be a very odd “number 3 gesture”. More likely random hand movement from the video used.