Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Key sentence - "the correct choice of face space axes is critical for achieving a simple explanation of face cells’ responses."

They did PCA over two sets of metrics, taking the top 25 components from each set and then combined that into a 50d space. Using these dimensions and measured responses to fit a model resulted in explaining 57% of variance in real cell firing rates. (Much better than other models including a 5 layer CNN).

This is pretty cool. I'd like to see a follow-up where the chosen dimensions were further refined using something a bit more iterative that an arbitrary PCA cutoff.

Also I really want to know what eye motion was present during each trial. This paper presents a very "instantaneous" recognition perspective and doesn't talk about integration over time or the impact of sequential perception of face components on recognition. (Eg an upside-down face is hard to recognize because your gaze has to move up from the eyes to see the mouth which is a sequence rarely encountered in the real world)



> Also I really want to know what eye motion was present during each trial.

Incredibly important point. Looking at the physiology of the eye and the early visual system all together it's not clear that we can see anything without motion: object motion or saccade (saccade appears to be fundamental). Start with Lettvin's famous frog's eye paper.

More interestingly, it does appear that recognition of a 2D image may be a complex, abstract learned behavor, while 3D recognition is at its core innate. And that writing (well, drawing) preceded reading.


See the methods: as usual for visual fixattion experiments with monkeys, the subjects were trained to maintain fixation, fixation window being only a part of the image size:

stimulus size spanned 5.7 degrees. The fixation spot size was 0.2 degrees in diameter and the fixation window was a square with the diameter of 2.5 degrees.

2.5 is relatively large in my experience, might be due to lots of noise on the signal, but anyway there won't be true saccades (well, there shouldn't be, parts of recorded data wehere with saccades outside of that fixation window should get rejected). Probably microsaccades were present, but those aren't mentioned.


A 5 layer CNN is absurdly shallow, so this isn't particularly surprising. I routinely work with 150+ layer CNNs - that's fairly standard practice if you want high-quality results.


This comment is absurd. "Your CNN sucks, I know this because I work with better ones all the time. (insert metric that doesn't mean much)"

Oh, BTW I do ML consulting lol


It's just not a fair comparison. No reason to get in a tiff.


Please don't be a jerk on HN. 5 layer CNNs are not best practice anymore, and it's not a fair comparison. Your flippant attitude is unwelcome.


VGG has 16-19 layers, Inception has ~50 layers, and ResNet has 150 layers. All of which were the state of the art at one point in time over the last ~2-3 years. A more faithful comparison with CNNs would've used one of these models pre-trained on a much larger face dataset.

Oh, BTW I do ML consulting lol


I don't think it's anywhere near conclusive yet that more layers = better. It's pretty telling that the current state-of-the-art is combining a bunch of layers together in a pseudo random fashion. Nobody understands how these things work to the point that we can make a formula or equation to produce better CNN's, or even predict which models will be more effective to any accuracy. You think more layers is better, because the best models we have happen to have the most layers? Some deep understanding of concepts there.


I don't think I or the parent comment are necessarily suggesting that more layers are better, but are pointing out that the fact that they're only using 5 layers suggests that they're not using a state of the art architecture. You can't faithfully say "oh a CNN cannot model this relationship" when it wasn't a thorough evaluation. (especially given that they don't mention modern face recognition systems like DeepFace or FaceNet, which I'd be interested to see if there's any correlation between the embeddings they produce if a simple PCA model works so well)

Also don't be so dismissive, we have a strong enough empirical and intuitive understanding of CNNs that we're able to make thoughtful improvements over time. In fact the insight behind the ResNet paper was noticing that adding layers doesn't improve performance and that training error actually degrades as layers are added – the solution to this was to construct the network so that it learns residual mappings that only modify the input rather than completely transform it. The whole point of that paper was solving this degradation problem so they could use some ridiculously deep architecture like a 150-layer network to get better results.


the current state-of-the-art is combining a bunch of layers together in a pseudo random fashion

Please don't say things like this. Neural network is nothing like a random process - people understand very well when to use what kind of layer, and when to add more layers.

It's generally pretty well accepted that more layers = better because it gives you better ability to deal with multi-dimensional relationships between features.

There are two reasons not to add layers:

1) If you are overfitting. In visual tasks this is fairly rare, and there are other better ways to combat this.

2) It's harder to train. Until ResNet came out, the idea of training 100+ layers was considered unapproachable. Even now, more layers make a much harder training task.


They only generated 2000 images to work with so I'm not sure any cnn can be expected to do terribly well.


True - it would be interesting to use, say, the coding layer of an autoencoder trained on one of the many face datasets and then training only on the last few layers to fit their data.


5 layers is AlexNet, which is a decent starting point for visual tasks.


Since when are 150+ layer CNNs fairly standard?

Inception-v3 has about 50 layers, and that's considered a lot (requires considerable processing power to train).


I would say since ResNet beat state of the art on ImageNet in 2015. It's not standard, but that's what state-of-art can take.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: