I don't quite remember whether it was first used in Vit paper[1], but it's a fairly straight forward idea. You take the patches of an image like they are words in a sentence, reduce the size of the patch(num_of_pixel x num_of_pixel) with a linear projection so that we can actually process it and get rid of sparse pixel information, add in positional encodings to put in location information of the patch and treat them as how you treated words in language models from that point on with transformers. Essentially, words are human constructed but information dense representation of language but images do have quite sparsity in them because individual pixel values don't really change much of an image.
The flattened image patch of width and height PxP pixels gets multiplied with a learnable matrix of dimension P^2xD where D is the size of the patch embedding. In other words, it’s a linear transformation that reduces the dimensionality of the image patch.
1: https://arxiv.org/pdf/2010.11929.pdf