Terrifying would be continuing to take everything we see as fact in 5 to 10 years. I think we are in a special time where if you see a video of pretty much anybody, you can usually tell right away if it's fake or not. Sometimes it's hard, but it can usually be debunked. Before that, you really couldn't fake any videos without it being obvious.
In 5 to 10 years, hopefully we will have learned to never ever ever take anything we see as fact, because we absolutely will not be able to distinguish rendered video from the real thing.
I used to be able to tell when there was CG in pretty much any movie but at some point it turned out that there were a lot more cases of CG in movies that I just didn't notice. I'm willing to bet that there are movies where you won't be able to tell if something is CG or not.
I recently found out that the majority of photos in an IKEA catalogue are in fact completely and entirely CG. Makes perfect sense of course, and the fact that they're doing it is entirely unsurprising.
What did surprise me was the quality of the renderings, according to this article, not even their own QA department can tell the difference between their photos and renderings any more: http://www.cgsociety.org/index.php/CGSFeatures/CGSFeatureSpe... (so they put their photographers on a 3D modelling course, and the 3D modellers on a photography course, to blur the distinction even further--really cool article, IMO).
True, but the downside of this is that anyone caught in the act will use this as a defense of first resort - it'll be the 2020's equivalent of 'my twitter account was hacked' until we establish some sort of reliable ELA tool to grade source material.
Indeed, as there are more and more cameras around (including autonomous ones of increasingly tiny size) imagery of the videographer will probably become a major authentication factor.
- We have mechanisms to prove that video is taken after a certain time (show a newspaper on the video, or get the video cryptographically timestamped via a trusted server).
- We have mechanisms to detect if something is taken before a certain time, by doing something interactive and unlikely with live viewers.
- We have mechanisms to detect if something has been modified from its original form (signing).
You might be able to make a CCD chip that signs every frame with a private key, and then ships the frame off to a public signing server too. Producing that CCD along with the video taken might be proof. But then you could defeat that with a display hooked to the camera, feeding the doctored image to the trusted camera.
I remember learning about Stalin's photo retouching, and reading about the systematized photo retouching and censorship in 1984. At the time I thought it was completely impractical, there's just too much work to do and not enough people to do it. Let's hear it for automation, putting the auto in Autocracy! :)
> You might be able to make a CCD chip that signs every frame with a private key, and then ships the frame off to a public signing server too. Producing that CCD along with the video taken might be proof. But then you could defeat that with a display hooked to the camera, feeding the doctored image to the trusted camera.
How does that defeat the cryptographic timestamping?
The goal is to have a video that can be used as evidence in court: what you're seeing actually happened. You can guarantee that the framebuffer recorded by the trusted camera is indeed an accurate recording of the photons that entered the lens.
But you still wouldn't be able to guarantee that that actually happened.. if we have enough compute and good enough algorithms, you could spoof a scene in real time on a small display strapped to the front of the trusted camera. Even though the trusted camera is recording what it sees, what it sees isn't really what is happening.
So we still don't get back to being able to use video as evidence.
I believe everything you have said is true, especially with cameras becoming smaller and more common.
However, I'll still be optimistic and hope that with the increasing number of cameras, people will be less likely to engage in activities where they shouldn't. The opposite argument would be that with the increasing technology to fake such an activity, the amount of 'my twitter account was hacked' incidents will rise.
But the "activities where they shouldn't" will be defined by the powerful. It's total state or corporate control, depending on your dystopian future preference.
The solution to this problem of pools of power probably lies in solving what is and isn't allowed using a continuous consensus based method. To enable that, we need realtime feeds of everything available, available to all, and some sort of filtering mechanism to deliver video to the right people for evaluating consensus on whatever happens that is contestable. Today's 'powerful' are only powerful because they control the information. Whatever gets built to improve upon society needs to ensure it can't be wielded for self gain or increased lever of power action.
Social regulation by Youtube commenters? What fresh hell is this?
Consensus is not sufficient. This is why rights-based approaches are developed. It has to be OK to live an unpopular lifestyle that's not harmful to others.
I agree, which is why I said "deliver video to the right people for evaluating consensus on whatever happens that is contestable". That gives you the right to live a way that is 'contestable' without having to pay a price for it (because you have a right to it).
We live in a majority reality. Enough force in the right place causes action.
I'm not suggesting a mechanism that operates on the 51% rule. I'm suggesting we 'elect' certain people to serve in roles that allow quick consensus to be formed when there is disagreement. Put those judgments in a safe place and you have 'laws' that can be referenced in the future.
I was thinking more along the lines of assaulting people or committing crimes. Things that could ruin your reputation or career that are generally disagreeable. I think what you are trying to say is that the powerful will determine what is right and what is wrong.
I can see this being used primarily against those in power though, so you aren't wrong.
Things that could ruin your reputation or career that are generally disagreeable.
A deeply closeted gay man (with a wife and kids) taking their first steps out of the closet and kissing a man could ruin his career with those things. Is that OK?
Another example of use of imagery for enforcing a minority's idea of "shouldn't": in response to Emma Watson's feminism, threats have been issued to leak nudes of her.
In a world where entirely realistic body images can be created of anyone, it becomes possible to threaten to "leak nudes" of anyone, even if they've never taken any.
> In a world where entirely realistic body images can be created of anyone, it becomes possible to threaten to "leak nudes" of anyone, even if they've never taken any.
It is also the world where nude photos of anyone have exactly zero value, because everyone who wants them can generate those themselves and there's no easy way to distinguish between real, doctored and fake ones.
I run a startup specializing in this space called the 3D Avatar Store (www.3d-avatar-store.com).
3D Reconstruction of human faces is literally on the edge of mainstream. I'm betting on it, personally.
Our system is similar as theirs, but more general: we laser scanned 300,000 real people and then associated each laser scan with dozens of photos of that person taken from different angles, lighting conditions and expressions. That data set was then used for a neural net training - actually a pipeline of neural nets.
We can accept 1 photo and get back a good quality 3D model, or a series of photos and get better quality, or HD quality video and get back frame by frame, in expression reconstructions just like their solution. In fact, our system is able to recover 36 people per video feed in real time, as well as handle 4 video feeds at once. We don't need as much reference information as they do, because we trained our system to generally understand the human facial form, rather than their solution that operates in isolation for a single reconstruction operation.
Our current system is targeted as a WebAPI for games and serious simulations - enabling 3rd parties to implement "put yourself in the game" functionality. As such we have 3 different geometry outputs aimed at game/simulation developers. We also do facial recognition, and we have a special "forensic" output for that.
Our current "best output" is purposely "Pixar like" rather than realistic. Taking them realistic tends to freak people out - especially women (seems like our culture has trained women to have an idealized self image, and when presented with their non-mirror true form, they don't like it.)
Very nice! Our work is different though: we create high detail 3d moving 3d models from youtube videos -- without any manual interaction (looks like there is quite a bit of interaction in creation of an avatar on 3d-avatar-store)
Ira
http://homes.cs.washington.edu/~kemelmi/
The interaction is primarily there to support users who supply poor quality photos. Given a photo taken with an actual lens (not a mobile phone's pin hole) the manual portion can be skipped. Plus, since we are only exposing single photo input (because given the opportunity to supply multiple photos, most users supply multiple garbage photos) certain profile features are difficult to recover. So we have a "3D detailing" interface so people can adjust their profile and add smile creases and so on. That 3D detailing interface also allows for exaggeration - which is how the avatar on our home page is presented.
Your work is very nice as well. Like yours, our video version requires no manual interaction. It's primarily used by government agencies, and we've not exposed it to the public yet.
And it looks like our site is under attack at the moment. If it does not load, give it a beat. We are seeing thousands of hits per second from eastern Europe right now. (Thank you Federal Reserve quality hardware firewall!)
I like that they show cases where it has problems. It's very much "here's what we can do, and here's where it doesn't work." There is no hype, no claims of "novelty", no speculation on uses, just results. I wish this were far more common.
I still would like to see the reconstruction from another angle than the original. Depending on how much you accept to cheat, it is not too hard to get a good result from that anfle only.
Somewhat off-topic, but I wonder why facial recognition/modeling experts seem to persistently ignore ears and jawline. As someone who works in film and does some picture editing (though it's not my primary skillset), ears are just as individual as other parts of the face, and they're one of the trickiest things for makeup artists to work on. As CG in movies and videogames keeps improving, my suspension of disbelief is often broken by noticing problems with the ears, eg watching a CG anime film and noticing that everyone has the same ear shape.
I would guess that the high incident angle makes it very difficult to accurately measure. You can see some trouble with the inside and bottom of the noses, and inside the mouth. Parts that are near orthogonal to the viewing plane look much flatter.
In this case it is likely because ears can be obscured by hair. If you notice this technique has some difficulty just with shadows so they limit it to the face.
> As CG in movies and videogames keeps improving, my suspension of disbelief is often broken
I have never really seen CG portrayal of people in live action films that didn't break suspension of disbelief, it always falls right into uncanny valley stiffness. The exception being Avatar, which for whatever reason doesn't seem to have had the problems other films have.
That's a silly statement. If it didn't break suspension of disbelief then it's because you didn't even realize it was CG! If you go to movies at all then I can all but guarantee you there are CG shots that you had no idea were CG.
Here's some super impressive CG from the first Captain America movie back in 2011. Crazy insane stuff they're doing it. And that's for a very, very extreme case. You better believe they are doing slick stuff in non-extreme cases!
http://www.fxguide.com/featured/case-study-how-to-make-a-cap...
For Iron Man it's not quite the same because it's a suit but it's just as impressive. Since Iron Man 2 there is no full suit worn by an actor. There are no legs and at this point there are barely even arms. There's a chest piece and open mask face piece and that's about it. http://movies.stackexchange.com/questions/2198/how-are-the-i...
In The Social Network there are a lot of scenes with the Winklevoss twins. Spoiler: They didn't use twin actors. They used two actors and CG'd the face of one onto the other body. No way in hell did you call this out on first viewing. (number 3) http://www.totalfilm.com/features/50-cgi-scenes-you-didn-t-n...
I think we are talking about CGI faces here and thats just not doable yet and this research won't change a thing about it.
The last movie I remember having a CG face was Clu in Tron Legacy which certainly was state of the art CG, unfortunately 'state of the art'[1] in face modeling/animation is just not perfect yet.
That link is real time. State of the art server farm rendered is orders of magnitude better.
"The last movie I remember having a CG face was Clu" By strict definition if you remember it having a CG face then it wasn't good. It's not possible for you to remember a good CG face because if it's good you wouldn't know it was CG!
The next big test might by Paul Walker in Fast and Furious 7 or Philip Seymore Hoffman in Hunger Games 3. Both big budget movies where the actor died and CG will be used in at least some places. That's the ultimate. I think we can already get away with CG if the actor isn't known. But for actors who have faces we know and mannerisms we subconsciously recogonize it's another step up in difficulty.
> which for whatever reason doesn't seem to have had the problems other films have.
By not CGI-ing humans. That's enough for you to stop judging what you're seeing by the same criteria as you do real life. They wisely chose to stay well to the left-side of uncanny valley instead of trying to jump over to the right-side.
I would wager that this is because most of Avatar is not live-action, but CG, with real people added in. Spend enough money/time and live-action people fit in really well to CG environments.
It took me to about the 3rd viewing before I started noticing some small details. There's an couple scenes where the main characters are touching each other and the arm/hand movement has a weird puppet-like quality that I can unsee now.
3D reconstruction is used in state-of-the-art facial recognition as well.[1] Essentially you reconstruct the face in 3D, rotate the 3D model to the front, project it back into 2 dimensions, and then feed it through a CNN with deep architecture. Because this gives you very good alignment, you can do tricks like not having shared weights across the entire image. That is, each section of the input vector is known to correspond to a certain part of the face and thus can learn unique parameters that are well suited for that specific region.
The paper claims that it takes about 105 seconds to render a single frame. So one second of 30 fps video would take about 52 minutes to render. I would have to read more in depth to see what kind of savings can be had by sharing information across frames. (The paper also doesn't mention the use of GPU acceleration.)
Roughly they use a model of the face, render it, compare the source image and the rendered image, estimate changes of position, orientation and deformation for the model that will yield a better match and then just repeat this until the result is good enough. While you probably can exploit temporal coherence between frames the process is inherently pretty expensive due to its iterative nature. But it may also be relatively easy to parallelize bbecause of this.
I'd like to ask the authors how they managed to do such great/natural looking reconstructions of the eyes. Eyes are tough because they're naturally specular, transparent in places, and refractive.
After quickly reading the paper, I think they did nothing special for the eyes; it's probably just the result of the final refinement step. The result looses a bit of its magic, though, when you read in more detail how it works. But it remains impressive.
I think I'm missing the point. Why are all the reconstructed videos from the same angle? It would demonstrate it better if they repositioned the camera.
> We note that our method does not guarantee an accurate fit to ground-truth geometry, as the shape of the face may change in each frame and single-image cues are not sufficient for this purpose. Rather, we seek to produce a reasonably convincing model (leveraging all available imagery) which optimally fits the image information in each frame.
This makes it seem like the mesh produced for each frame is specially deformed to render optimally for that camera angle, and it wouldn't necessarily look perfect from other angles.
I've been increasingly interested in the Face. Human beings must have some incredible mental calculations going on when parsing a face. We're an evolved species that use the face as a form of communication.
I love the attached video in the link because it isolates perfectly the face. If you look closely you can see these tiny minute combinations within the face as each person talks; the eyes shifting, the face rotating, looking in various directions, the forehead crunching, the eyebrows raising, smiling, etc. All of these "cues" combine to create a message that we interpret instantly.
The face has inspired me lately to read more into this subject as it seems, at least on the surface, to be an extremely complex innate human ability; facial recognition.
It really is incredible. I mean, in the original link, look at the image of Arnold used for the video before playing. It's a blurry, greyscale section of his face. Nonetheless, most people could easily recognize that face as being him.
Billions of people in this world, we all have a very similar facial structure with two eyes, a nose and a mouth, and yet you can recognize that small blurry face in a fraction of a second.
I imagine it's similar with animals. You can have thousands of birds in a flock, and they can recognize their mate instantly. To us, we'd have to carefully analyze the birds for days or weeks to make that same match.
What amazing is that scientists discovered that brain allocates one single neuron fires for each recognized face. For example, You have a one special neuron in your brain that fires when you see Bill Clinton's face. And that exact neuron fires no matter what picture of Bill Clinton you happened to see.
> What amazing is that scientists discovered that brain allocates one single neuron fires for each recognized face.
This is nonsense, citation needed.
> For example, You have a one special neuron in your brain that fires when you see Bill Clinton's face.
This is New Age science, i.e. not science at all. It's a myth. One neuron is equivalent to one bit in a computer. One bit of information is insufficient to distinguish between faces.
> And that exact neuron fires no matter what picture of Bill Clinton you happened to see.
Maybe you could learn a tiny bit of neuroscience before spreading this kind of nonsense.
You need to learn some neuroscience buddy. One neuron is not one bit. A neuron works in much complex way than a single bit in a computer. A neuron has hundreds to thousands axons and dendrites and connect with other neurons in a dense network. I am surprised by your knowledge of neuroscience yet attacking me. very strange.
EDIT: As dragonwriter says below, lower levels of neural networks are responsible for general facial recognition but that triggers more specific neurons once the recognition is done from generic face -> specific person. Even in artificial neural networks single bit is sufficient at the end of the classification.
In that case, your original claim was false -- either one neuron is acting alone as you claimed, or it isn't. Your claim was that one neuron was acting alone, which is absurd.
A neuron is interconnected to many others, but this doesn't mean one neuron is many neurons, any more than one binary bit is 64 binary bits by virtue of its position in a binary word.
The link is posted above. It is not nonsense. The research was done at Caltech by neuroscientist Christof Koch , one of the top neuroscientists in the world.
I am not sure why the link was down voted. But here is the excerpt from the wiki page:
In 2005, a UCLA and Caltech study found evidence of different cells that fire in response to particular people, such as Bill Clinton or Jennifer Aniston. A neuron for Halle Berry, for example, might respond "to the concept, the abstract entity, of Halle Berry", and would fire not only for images of Halle Berry, but also to the actual name "Halle Berry".
> However, there is no suggestion in that study that only the cell being monitored responded to that concept, nor was it suggested that no other actress would cause that cell to respond (although several other presented images of actresses did not cause it to respond). The researchers believe that they have found evidence for sparseness, rather than for grandmother cells.
This can help a lot in recreating the faces of avatars, in animated movies and games. They have tough time tracking facial details using small markers.
I was wondering using shadow and shine removal to solve the issues shown in end. An example here is implemented by these autonomous car designers detailed as shadow correction:
http://www.igvc.org/design/2013/US%20Naval%20Academy.pdf
This kind of advancements is one of the reasons I don't post photos of myself online. In a few years we will be capable of making videos with anybody's face replaced with anybody else's. It will be trivial to produce a fake video that can cause all kinds of legal troubles.
And yes, I realize it's even possible now, but with all the new algos and software coming out it will be easy enough for somebody to just mess with people's lives for fun.
doesn't matter if you post photos of yourself online, very likely it will be others that do so. And your image is being digitally captured constantly, if you travel on public transport or go through an airport, or do anything in a public place your photo has been captured many times. Figuring that it's you is also pretty straightforward. All it takes is a DMV hack or Facebook (where one of your coworkers tagged you in a photo, or instagram where your niece tagged you by the bbq on july 4th etc etc etc...
Stroll through a few blocks of midtown Manhattan and you'll end up on hundreds of security cameras and in the backgrounds of a handful of tourists photos. Add a few more years and your image and gait will be continuously recorded by a swarm of 3D camera phones, self driving vehicles, and police drones. Authentication of a crime may require not one but a diverse set of recordings from multiple entities. Validating that the person you are talking to on Skype is indeed your mom may require a cryptographic key exchange (and a damn good reason it won't be Skype but a tool that can be verified by third parties.)
For those of us who grew up reading Neuromancer, Snow Crash, and other cyberpunk yarns: that is today, we are living in that world at this moment.
While I think this is a derangement of society, I can imagine a porn app that let's you drop in a facebook id to superimpose your crush's likeness upon a porn actor.
You assume legal framework will keep up to pace with technology with no lag.
I trust that, due to jurisprudence model, the legal framework will always have some lag[1] after game-changing technology becomes mainstream... sometimes this can be significant.
[1] disclaimer: active legislation efforts can move faster than tech - especially due to lobbying.
In theory it doesn't have to - there are already legal procedures for qualifying and admitting expert testimony at trial that work quite well (see http://en.wikipedia.org/wiki/Daubert_standard). So if you were in jeopardy primarily due to video evidence which you knew to be faked then you could bring a machine vision or CG expert into court to demonstrate the unreliability of the video evidence.
But this is not to say that our system for handling forensic evidence works great - it doesn't sadly, and part of the problem is that juries are often reduced to trying to weigh the credibility of competing expert testators without rigorous standards of measurement or terminology. You'll probably be interested in this: http://www8.nationalacademies.org/onpinews/newsitem.aspx?Rec...
Tragic they removed the verbal audio from the demo video. It would have been much easier to judge the visual accuracy if the reconstructed lip motion were combined the original sound.
The source videos they used as examples are about 600 frames each, which is ~40 seconds of video. For a politician or professional actor this would be easy to achieve. Likewise if you wanted a good 3d model of yourself or a friend you could do it cheaply using this method.
I thought that too, but then they have some ambiguous sentences like 'In this paper we target high detail reconstruction from a single video captured in the wild, i.e., under uncontrolled imaging conditions,' where it seems as if they treat that as their input library. Other parts of the paper support your interpretation. It's annoyingly vague.
...but still, outstanding work even if it requires manual calibration like curation of the input set.
Quite dangerous what this technology will mean in the future, they could manufacture evidence against you, publish it, and then let the masses lynch you.
I think this is what happened to the recent decapitation videos, they were reconstructed from home videos.
IMO the videos with the people dead in the floor are true, but the videos where they talk are staged.
Remember Osama Bin Ladem appeared and disappeared according to US army interest at the time, finally ending in very strange circumstances(and being buried on the ocean, not letting anyone else interantionally to confirm(by DNA) he was Osama).
For me it is staged because current technology could synthesize a voice only if there are not strong emotions. The same happens with the voice.
With strong emotions it becomes very easy for familiars and friends to notice as people do specific gestures and most of them are not recorded in video.
That people are perfectly calm before dying I could understand, but that they do while saying exactly what their captors want I can't.
Also, before the videos most of the population in UK did not want to go to war, after the videos(with a UK native), most of them support war, quo prodis?
It's terrifying to think that in the next 5 to 10 years we won't be able to distinguish a forged, high definition video of pretty much anybody.