Agreed. ViTs are better if you're looking to go multimodal or use attention-spec... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		fzliu on May 19, 2024 \| parent \| context \| favorite \| on: Convolutional Neural Networks for Visual Recogniti... Agreed. ViTs are better if you're looking to go multimodal or use attention-specific mechanisms such as cross-attention. If not, there's evidence out there that ViTs are not better than convnets for small networks and at scale (https://frankzliu.com/blog/vision-transformers-are-overrated).

abrichr on May 20, 2024 [–]

ViTs also have proven to be more effective for zero-shot generalization tasks due to their ability to capture global context and relationships in the input data, which CNNs struggle with.

https://arxiv.org/abs/2304.02643

Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact