First, my earlier comments on a "competing" approach from Facebook may help give relevant context for how to think about these numbers: https://news.ycombinator.com/item?id=7393378
Briefly skimming through this paper, it appears that these numbers are not a fair comparison, as this paper uses the unrestricted protocol of LFW[1], whereas the other methods in the ROC curve shown in the paper are using the restricted protocol. As you might imagine, the latter is more restrictive -- specifically in terms of amount of training data allowed. And as I mentioned in my previous comment, training data is king in these kind of systems -- more is always better.
To go slightly out on a limb, I think more significant than the new theoretical model proposed in this paper is probably the use of lots of different types of datasets for training. (Significantly more data >> more complicated models, most of the time.) But I'd have to read the paper much more carefully to be sure about this.
What's the point in limiting yourself to small datasets? It forces you to be "clever" about preprocessing, because the levels of freedom in your learning algorithm must be limited to match the size of the dataset. Being clever like this is precisely what we're trying to avoid with machine learning algorithms. It's better to just shove the raw data into a very general algorithm like neural networks and let the data do the configuration. And do that you need _lots_ of data.
There is a conflict between an algorithm's performance and its learning speed. Restricting ourselves to low amounts of data means we get algorithms with lower maximum performance than we otherwise would. These are then claimed to be better than algorithms which need more data to generalize well.
But I agree with you in general. I suspect even LFW's creators might; I don't know if they expected the benchmark to still be tested on this many years after creation.
I've always felt that all benchmarks in vision should come with expiration dates, because eventually everyone starts implicitly overfitting to them.
Also, one huge reason why we might "limit ourselves to small datasets" is because that's the only way we can compare many algorithms.
Outside training data has a huge influence on algorithm accuracy. For example, since Facebook has access to bajillions of face images that they could use to train their classifier, and since they can't share those images with the rest of the community, it's unclear whether their excellent performance is because they have gobs of data or whether it's really a better algorithm. I bet that a simple algorithm like a naive SVM might do leagues better if we could train it on Facebook's (hidden!) dataset and test on LFW. It just isn't reproducible---a measuring stick is only meaningful if everyone uses the same measuring stick.
Isn't small datasets the reason for using SVMs? You are, after all, operating in linear space preprocessed with a kernel trick. Far less expressive than a neural network.
Well, isn't it a bit awkward when dataset size is what forces you to select a certain algorithm that you otherwise wouldn't use? "Gee, I would love to train a neural network, but it's just so much data and I'm on a deadline; maybe I should just use an SVM and hope for the best ... ..."
That's one of the big surprises behind deep learning these days: it's now feasible to do things like "train a big neural network on bunches of images" in a sensible time. It's an optimization thing as much as it is a machine learning thing, in my opinion.
I didn't mean that training time is the constraint. It's that when training a more general algorithms (ANN) your hypothesis space has more dimensions than when training a more specialized one (SVM). You therefore need more data to train an ANN than a SVM. The reason for choosing SVM is not that your dataset is too big for ANN, it's that it's too small for ANN.
That's a great point, and it's exactly why the LFW results page makes a clear, distinct separation between algorithms that use outside training data and algorithms that do not: http://vis-www.cs.umass.edu/lfw/results.html#notesoutside You can probably guess which ones generally do better.
(Note that this is separate from the "unrestricted" vs "restricted" issue)
I think they used the correct ROC curve. Actually, it looks like they cherry-picked a few results from the Commercial and the Academic categories, but everything is from the unrestricted class.
We should wait until Erik Learned-Miller lists this on the LFW results page. He'll know how to interpret their results.
The original url [1] was blogspam—that is, it was a knock-off (or excerpt) of some other, more original source. In such cases HN strongly prefers the original source.
Submitters: blogspam is usually easy to recognize. Please check for that and post the original instead.
Had he tried to post the original (perhaps he did) it would have been denied as a duplicate. What do you suggest is the right thing to do in that case?
You can modify the URL slightly. The duplicate detector is deliberately left porous to allow the best articles multiple cracks at the bat. A small number of reposts is ok. Reposting the same thing over and over, however, is not ok, nor is deleting and reposting.
If you read it carefully, there's a caveat about how this particular dataset (recognition of /cropped/ pictures of /unfamiliar/ persons) has relatively low human accuracy (97.5% as opposed to 99.2%), because humans also use other features. That is, there's more work to be done, and facial recognition isn't a "solved" problem yet.
All the same, very impressive work. Congratulations to the authors on achieving such an important milestone.
It's true that the paper is the original source, and it's certainly ok to post those. But a user pointed out recently [1] that many HN users might have time to read a good general-interest article but not the paper itself—and usually the paper is referenced in the thread for those who want it. So either is ok, but if you're going to post a general-interest article, try to make it the most substantive one out there.
A note -- if you're linking to arXiv, it's better to link to the abstract (http://arxiv.org/abs/1404.3840) rather than the PDF. From the abstract, one can easily click through to the PDF; not so the reverse. And the abstract allows one to do things like see different versions of the paper, search for other things by the same authors, etc.
Really? It seems that the pdf is somewhat broken (24 pages is 12 pages of the proper paper and the next 12 are from a slightly older version(it seems))
Face recognition is one of those technologies that's seems neat at a glance and mindbogglingly terrifying on closer inspection. It has the potential to sci-fi the world overnight and it could do it tomorrow night. The algorithm accuracy and enormous comparison DBs are already here.
The effect this can have on commerce, advertising, policing, crime, culture, or a bunch of other things has enough wide reaching effects for a sci fi thriller.
A camera in cahoots with a till in a supermarket could put a face and a name on every purchase. If the camera and the till in cahoots with an advertising billboard in a shopping mall, you have created an offline version of conversion tracking.
Since the supermarket and billboard company are in cahoots, they can compare notes and find a billboard location that gets the supermarket's best customers. If you are seen checking out climbing gear by a camera in cahoots with Facebook, that store can keep outdoor activity products to you in Facebook. Hello offline retargeting.
That's just advertising. Imagine policing. Imagine high school.
> It has the potential to sci-fi the world overnight and it could do it tomorrow night.
Commerce and government already use face recognition and other machine vision capabilities like license plate recognition to track you.
Simple, cheap face recognition would just level the playing field: Was the cop approaching your car written up in /r/bad_cop_no_donut? Is the bureaucrat you are dealing with known to be obstinate?
I'm a computer vision grad student. A few things concern me about this work. Maybe they're incidental, but I'm not ready to throw my hands up in the air quite yet.
- Why wasn't this accepted to CVPR/ECCV/one of the well-established computer vision conferences? I would love to read some of the reviewers' comments about this work before I give further judgment. (If this really is some CVPR preprint, or if it actually is peer-reviewed, I'd feel much better about this.)
- Why isn't this work listed on the official curated "LFW Results" page that Erik Learned-Miller maintains? http://vis-www.cs.umass.edu/lfw/results.html Is this work so new that Erik hasn't had time to review it yet?
- Human performance on LFW is 99.2%, which is higher than what the authors think it is. The performance drops to the (claimed) 97% when we only show humans a tight crop of the face: http://www1.cs.columbia.edu/CAVE/publications/pdfs/Kumar_ICC... They discuss this difference in a paragraph in their conclusion, but I consider it dishonest to use the lower number in the abstract and imply it in the title. In fact, I consider it misleading to put "Surpassing human performance" in the title to begin with, but that's another matter :)
- Showing good performance on one dataset (LFW) is certainly not enough to show that this "outperforms humans" in the general case. Getting a state-of-the-art result on LFW these days is like squeezing a drop of water out of a rock; in my opinion, we should turn our attention to harder datasets like GBU now that these "easier" ones are solved.
I'm not terribly familiar with Gaussian processes so I'm not sure whether the math works out, but it is a pretty uncommon thing to try in this domain. (Perhaps that's what makes this work interesting, especially since this year seems to be the "Deep Learning is Eating Everyone's Lunch" year)
I also wish they describe what final-stage classifier they use for the "GaussianFace as Feature Extractor" model. Often, that's the most important step; it's strange that they didn't compare with POOF/High-dimensional-LBP/Face++'s deep-learned features/any of the other state-of-the-art feature extractors, especially considering how much worse "GaussianFace as a binary classifier" does (93% vs 97% is a huge difference in this dataset)
Just my two cents. It definitely demands further exploration. I don't see any obvious mistakes, but I'm not sure why their approach works as well as they claim it does either.
Edit: I don't mean to start a witch hunt or anything, but if the authors have the guts to put "Human-level performance" in their title, they're just begging for the community to inspect every detail and point out all the flaws in every minutiae in their work. It's our community's hot button. It's similar to the old adage about how if you want a Linux user to help you, you have to tell them how much Linux sucks. That's where much of my skepticism comes from. The most astounding papers are often the most humble, but "humble" certainly doesn't describe this work.
In some sense, the 97% is indeed the more fair number to compare against, assuming that this paper also restricts the algorithms to only see tight crops of the face.
Obviously using more than just tight crops will give you more information, but our point in differentiating those cases when measuring human performance was that the way LFW was constructed gives you MUCH more information when using loose crops than "normal" images would. For example, many images of actors are at award shows (the Oscars in particular), and so if you see that kind of background in a pair of images, you can just say "same" and have a very good chance of getting it right. That's what the "inverse crop" experiment shows [1] -- when you block out the face in LFW images, you can still get 94% accuracy!
In normal images, however, the background won't normally give you so much information.
I do feel somewhat bad that our human verification performance experiment numbers are now being used to create linkbait titles like, "computer algorithms can beat humans," because that's obviously not true (nor have I ever believed it), but in my defense, in 2009 we didn't really think about what the press would do in 2014 when algorithms started saturating on LFW =)
I am not in the area but Xiaoou is not nobody in the area. He used to work (or visit) at MS Research Asia and Chinese Academy of Sciences. He had best paper award from CVPR and served as chair of ICCV. That all being said, we all know how research goes... I am just curious why they put it on arxiv. Is this a common practice in the CV area or is it because someone else is working on a similar approach?
gcr makes a number of excellent points here. I completely agree about the human performance comments. Human performance on LFW is 99.2%. While some people claim this is "not fair" because it includes some background of the picture, this is the way the benchmark has been defined. There is no universally accepted way to define the face verification problem, and when we built LFW we chose to define the task on these images (without cropping them closely). Not surprisingly human performance goes down when the images are cropped more closely to the face. My response to people who say that it is unfair to use the whole image is "go ahead and write a program which takes advantage of the whole image!"
I have not read the paper carefully yet, so I cannot comment on the technical content. However, people have raised the question of why it's not yet on the LFW results page. First of all, we don't put up results until we received the results file from the authors yet, and we haven't received such a file yet, so of course the results are not up yet. Second, even if we receive results files from an author, we require that the paper either be published in a peer-reviewed conference or journal, or, alternatively, that if it is commercial, that an executable be available so that the system can be tried by other experimenters. If we are sent results from a paper that has been accepted but not yet published we sometimes show the results on our results page, but highlight it in red to show that the paper is forthcoming.
Additionally, Gary Huang, who does a fantastic job maintaining the LFW site, is redoing the entire results page to better present the information, and so results may experience a short delay while we are working on the new page.
One interesting thing to note about methods that are getting very high accuracy rates on LFW is that they typically use enormous amounts of outside training data (like the recent DeepFace method by FaceBook researchers). Part of the purpose of our reorganization of the results page is to highlight better the differences between methods that use outside data and those that use only the LFW training data.
The community should remember that a K-nearest neighbor classifier is the best possible classifier for any supervised learning problem when you have enough training data (as proven by Stone, 1977). Thus, it shouldn't be too surprising that many different methods work very well when given massive training sets. We saw the same thing back in the 90's with the MNIST data set.
I agree with gcr that moving to new data sets makes sense, although performance on LFW with limited training data may continue to be interesting for some time.
Since you work in the field, could you please recommend some free dataset for facial recognition?
I was trying to train an LBP classifier (mostly for fun), but I was disappointed to find out that I couldn't get my hands on some large dataset.
In particular, I was hoping to find not only a training dataset with the cropped frontal faces, but possibly also side faces, and then a large benchmark dataset that I could use as well to test the recognition accuracy.
It makes me mad to think that such datasets come basically for >free< for companies such as Facebook, but I couldn't get my hands on a free one that I could use to improve the OpenCV classifier...
That is a great point. Some of this is due to privacy issues -- since my school's IRB board considers "Face recognition" to be "human experimentation," anyone who collects these datasets in a laboratory setting should go to the trouble to ensure that the subjects inside the datasets are treated fairly. Part of that means vetting who they give the dataset to. Sometimes there's bunch of red tap and silly rules about "When you show people dataset samples, you can only use these images... ..." Many of the high-quality pose-controlled sets require you to physically write to some student and request to enter into a license agreement with that institution, even though the dataset might be "free" to access.
There are other datasets from around the web that are much easier to get your hands on. LFW is honestly a great way to start, since it's freely available and since the "Results" page provides links to many state-of-the-art papers in the field (with PDF links for most of them, too!)
If you do want something pose-controlled and high quality, you might just have to bite the bullet and write to the curators of "Multi-PIE" or "PIE" at CMU.
Huzzah! Someone else recognizes GBU, i've been working with the FOCS dataset for some time, and wow is it difficult. My best result (using only the ocular region) is around 18% eer (note, with eer better is lower, my rank-1 id is easily +90%...but that's not a good measure of system performance IMO), Unfortunately it doesn't have the first name basis like LFW...
For what it's worth, the graph (Figure 4) on page 10 features the label "Human, cropped 97.53%" and later again in the comparison section but not in the abstract. Still, the whole thing raises some red flags!
I'll have to run this by my friend who writes morphometrics algorithms, as I can;t actally tell what is new about this paper. This might actually allow for a proper photo-matching search engine. All the ones that I have tried to this point have been lacking or broken...
Believe it or not, LFW does not include a whole lot of rotation variance in its images. One reason for this is because LFW was originally selected by an automatic face detector in the first place, which doesn't detect rotated faces very well. Here are some sample images from LFW: http://vis-www.cs.umass.edu/lfw/number_11.html In almost all of them, the person is usually facing the camera (no profile shots!) and the in-plane rotation is not very large (their eyes are parallel to the horizon; they're not upside-down)
The fact that we aren't shooting for rotation-invariance is a problem with LFW, not necessarily this algorithm, but you're right to say that the authors' paper would be much more convincing if they tried their approach on several different datasets.
(In particular, the "Multi-PIE" dataset aims to specifically stress rotation-invariance, even though it's not quite as popular as LFW. These guys do use Multi-PIE as part of their training and they do in fact measure their algorithm's performance on Multi-PIE; see page 24. They don't seem to talk about it much since everyone is crazy for LFW and since Multi-PIE is "easy" for other reasons...)
Everyone took the same courses and more or less familiar with the basic algorithms. The task of recognizing a passport-like pictures is nearly trivial, while a "random, on the street" face recognition is a different problem.
People are using environmental and contextual cues (so, does facebook's system) and they mostly guessing rather than "exact matching" peoples.
Anyway, as long as it is not a passport photos the task of face recognition has almost nothing to do with basic NN algorithms, but a lot with contexts and cues.
Exactly. That's why we should be (IMO) focusing on hard face recognition problems, full of rotation and occlusion and blur and all the "icky" parts of the real world that we don't like to deal with.
There is some of that work going on out there, but I would be very surprised to find that spirit in a paper with the words "LFW" in the title. It's just a different focus.
In its time, LFW was "hard" since most of the face datasets (FERET, AT&T) were completely controlled: the subject visited a laboratory, sat down in the best pose, made the best face expression, and had their picture taken by the best camera that grant funding could buy. The "in the wild" part of "Labeled Faces in the Wild" alludes to the fact that these are more real-world than what the community was used to.
It's time to continue that train of thought and move on to something harder though.
Briefly skimming through this paper, it appears that these numbers are not a fair comparison, as this paper uses the unrestricted protocol of LFW[1], whereas the other methods in the ROC curve shown in the paper are using the restricted protocol. As you might imagine, the latter is more restrictive -- specifically in terms of amount of training data allowed. And as I mentioned in my previous comment, training data is king in these kind of systems -- more is always better.
To go slightly out on a limb, I think more significant than the new theoretical model proposed in this paper is probably the use of lots of different types of datasets for training. (Significantly more data >> more complicated models, most of the time.) But I'd have to read the paper much more carefully to be sure about this.
[1] http://vis-www.cs.umass.edu/lfw/results.html