I don't quite remember whether it was first used in Vit paper[1], but it's a fai...

PoignardAzur · on Aug 25, 2022

> reduce the size of the patch(num_of_pixel x num_of_pixel) with a linear projection

What does that mean?

(Thanks for the explanation)

blueblisters · on Aug 25, 2022

The flattened image patch of width and height PxP pixels gets multiplied with a learnable matrix of dimension P^2xD where D is the size of the patch embedding. In other words, it’s a linear transformation that reduces the dimensionality of the image patch.