I work on ML recommendation algos for a big Facebook-sized company, so perhaps I can give some insight on how this model could be used. This is an example and might not resemble anything they're doing internally. This is more ELI16:
Facebook marketplace:
- Seller posts a listing with image + description
- FB uses data2vec to transform the image + description into one single vector (e.g. you could average the two vectors)
- Buyer searches for a product using text search - the text is also encoded into a vector with data2vec
- To serve product search results, you find the closest match to the text query vector from your pool of product vectors
This pattern described above is generally called vector search and is very common for recommendation algorithms and much more. There's a shift towards algorithms like data2vec that can combine different types of data into one vector. The aim is that an image of a dog and the word "dog" would map to the same vector, meaning that the vector represents the concept of a dog, regardless of the input data type.
The advantage of these "multi-modal" algorithms (i.e. they can take multiple data types as input) is that you can (in theory) use them across all of your ML algorithm needs. If you're Facebook, you have 100s of teams and services that have this need. A few examples:
- Instagram ads prioritization
- Instagram search
- Harmful content moderation
- Facebook content search
- Facebook marketplace search
Each of these is likely a separate team, very likely using a separate embedding algorithm. As approaches like data2vec improve, there will be some consolidation.
N.B. - I've made a lot of assumptions based on what I've seen at my current employer. If anyone from Meta/Facebook is reading this, please chime in!
I think your understanding of this method is wrong. This work is about unifying the training objective across modalities, not training on multiple modalities simultaneously. A single data2vec model is not meant to take different types of data as input, at least as far as I understood it.
Directly from the paper:
"Our work does not perform multimodal training but aims to unify the learning objective for self-supervised learning in different modalities"
Classic tech company logic -- spend hundreds of thousands of dollars on machines, researchers, and implementers to create a new machine learning model to "improve an experience" on their site. Then stuff it with ads, which were the real obstacle to the experience anyway.
Facebook marketplace:
- Seller posts a listing with image + description
- FB uses data2vec to transform the image + description into one single vector (e.g. you could average the two vectors)
- Buyer searches for a product using text search - the text is also encoded into a vector with data2vec
- To serve product search results, you find the closest match to the text query vector from your pool of product vectors
This pattern described above is generally called vector search and is very common for recommendation algorithms and much more. There's a shift towards algorithms like data2vec that can combine different types of data into one vector. The aim is that an image of a dog and the word "dog" would map to the same vector, meaning that the vector represents the concept of a dog, regardless of the input data type.
The advantage of these "multi-modal" algorithms (i.e. they can take multiple data types as input) is that you can (in theory) use them across all of your ML algorithm needs. If you're Facebook, you have 100s of teams and services that have this need. A few examples:
- Instagram ads prioritization
- Instagram search
- Harmful content moderation
- Facebook content search
- Facebook marketplace search
Each of these is likely a separate team, very likely using a separate embedding algorithm. As approaches like data2vec improve, there will be some consolidation.
N.B. - I've made a lot of assumptions based on what I've seen at my current employer. If anyone from Meta/Facebook is reading this, please chime in!