Hacker News new | past | comments | ask | show | jobs | submit login
Meta AI: Fist high-performance self-supervised algorithm for multiple modalities (facebook.com)
89 points by leirbagarc on Jan 21, 2022 | hide | past | favorite | 56 comments



Can someone explain this like I am 5. What are the use cases when it says works on images, text etc? Why is this a big deal? What's the human input here? And what output to expect?

From what I understand, human validation (supervision) is not happening while algorithm is training on data. Is that right? Will this be open to the public via standard ML frameworks or proprietary?


I work on ML recommendation algos for a big Facebook-sized company, so perhaps I can give some insight on how this model could be used. This is an example and might not resemble anything they're doing internally. This is more ELI16:

Facebook marketplace:

- Seller posts a listing with image + description

- FB uses data2vec to transform the image + description into one single vector (e.g. you could average the two vectors)

- Buyer searches for a product using text search - the text is also encoded into a vector with data2vec

- To serve product search results, you find the closest match to the text query vector from your pool of product vectors

This pattern described above is generally called vector search and is very common for recommendation algorithms and much more. There's a shift towards algorithms like data2vec that can combine different types of data into one vector. The aim is that an image of a dog and the word "dog" would map to the same vector, meaning that the vector represents the concept of a dog, regardless of the input data type.

The advantage of these "multi-modal" algorithms (i.e. they can take multiple data types as input) is that you can (in theory) use them across all of your ML algorithm needs. If you're Facebook, you have 100s of teams and services that have this need. A few examples:

- Instagram ads prioritization

- Instagram search

- Harmful content moderation

- Facebook content search

- Facebook marketplace search

Each of these is likely a separate team, very likely using a separate embedding algorithm. As approaches like data2vec improve, there will be some consolidation.

N.B. - I've made a lot of assumptions based on what I've seen at my current employer. If anyone from Meta/Facebook is reading this, please chime in!


I think your understanding of this method is wrong. This work is about unifying the training objective across modalities, not training on multiple modalities simultaneously. A single data2vec model is not meant to take different types of data as input, at least as far as I understood it.

Directly from the paper: "Our work does not perform multimodal training but aims to unify the learning objective for self-supervised learning in different modalities"


That won't really help them-- their UI is so horrific for search, especially in the market place -- that it won't improve anyone's experience there.


Classic tech company logic -- spend hundreds of thousands of dollars on machines, researchers, and implementers to create a new machine learning model to "improve an experience" on their site. Then stuff it with ads, which were the real obstacle to the experience anyway.


A very very profitable, high margin obstacle to the experience.


Yes! Definitely like this analysis :)


Very cool!


Validation is different from supervision. Supervision happens during training. Typically when we train speech model, we prepare a speech segment and its transcript, this transcript is generated manually by human. For language models such as BERT or GPT-3, you don't really need human generated labels, what it does is to mask out some text, and the masked out text is the label to be predicted (self-supervise), the label is already in the data. What this model does is to universally train on images, text and speech this way, and when you combine these data, they will be complementary to each other. Think what you can do with a language model like GPT-3 now, you would be able to ask the model to do something with text, speech and images. The only thing I would worry about this model is that I would imagine Facebook trained on lots of data, and the model may be used to extract sensitive data maybe.

The multi-modal part is a big deal because we could have a model be able to understand all inputs human needs.


The benefits are similar to when you consolidate tasks with a lot of similarities. Advancements in one area can be immediately applied to another area. You can consolidate your effort into a single point. Training a network is increasingly cost prohibitive, into the millions of dollars, and this would allow you to consolidate these expenses. EDIT: Looks like they still need to train for each modality so no benefit from that here.

I'd imagine on the flip side the potential problems would be when gains in one modality are offset by a decrease in performance in another or you are prevented from trying new things because of the choices made to support a unified architecture.


It has already been public on their repository, fairseq.


Except not the vision part, which is what they were discussing. Also, I don't see any way to run the examples as the code is missing those files: https://github.com/pytorch/fairseq/tree/main/examples/data2v...

The code is what we're interested in here, not "hold onto your papers" talk that tells us how to be excited about it.


I imagine a multimodal platform like Facebook (with people using photos and text together in some way) would find a single model valuable.


I don't believe it's the actual first but this is pretty awesome. too bad its facebook :/


I think the big achievement is how it surpassed in performance previous models for each individual modality.


That's not true for NLU at least. It is on par with 2018's RoBERTa on GLUE, many larger and advanced language models came after.

It is still great work though, a robust masking representation architecture that works across modalities.


They mention that on the article:

> We apply data2vec separately to speech, images and text and it outperformed the previous best single-purpose algorithms for computer vision and speech and it is competitive on NLP tasks.


Basically they cut out a part of the input and make the network predict the missing part. (edit: they actually predict the average of all features). This works for images, audio, text. This produces high quality feature representations for data which can be used to build specialised networks on. The two main tricks are:

1. Do the cutout in feature space, not the original input space. (edit: cutout is actually in input space)

2. The above would likely just collapse the features to 0, so they use the same network that does the reconstruction to produce the features (!). In their own words:

"We first encode a masked version of the training sample (model in student mode) and then construct training targets by encoding the unmasked version of the input sample with the same model but when parameterized as an exponentially moving average of the model weights (model in teacher mode)"


I believe that your interpretation is not correct. Based on my brief reading of the paper, the model contains 1) some known architecture for embedding the modality, and 2) the feature reconstruction transformer network, the two being trained at the same time.

If I am not mistaken, the masking occurs in the input modality, not the feature-space, even though it is the feature-space that is used for the reconstruction task.

Regarding how the feature space is kept uncollapsed, it seems like a hyperparameter-tweaking (ie unsolved?) problem; quoting the paper:

" Representation collapse.

A common issue with algorithms which create and predict their own targets is representation collapse. This occurs when the model produces very similar representations for all masked segments making the problem trivial to solve. Different strategies have been proposed to address this issue, e.g., contrastive models such as wav2vec 2.0 (Baevski et al., 2020b) use the same target representation both as a positive and a negative example, preventing collapse. Algorithms such as BYOL (Grill et al., 2020) do not optimize the teacher parameters to minimize the loss. VicReg (Bardes et al., 2021) adds an explicit loss encouraging variance among different representations.

In our experiments we found that collapse is most likely to happen in the following scenarios:

First, the learning rate is too large or the learning rate warmup is too short which can often be solved by tuning the respective hyper-parameters.

Second, the EMA decay rate is too low which leads to student model collapse which is propagated to the teacher due to parameter tracking. This can be addressed by carefully tuning τ0, τe and τn.

Third, we found collapse to be more likely for modalities where adjacent targets are very correlated and where longer spans need to be masked, such as for speech. We address this by either explicitly penalizing the lack of variance (Bardes et al., 2021), or by promoting variance through normalizing target representations over the current sequence or batch (Grill et al., 2020). The former worked well for small models but is less reliable for larger models and it also requires tuning additional hyper-parameters. In contrast, we found applying instance or batch normalization before or after averaging targets to work well while being simpler. For models where targets are less correlated such as for vision and NLP, momentum tracking is sufficient to prevent representation collapse. "


Are there papers that show similar results on varied structured/relational/graphed data modalities?

Even with large/labeled/cleaned dataset, it seems that each domain change or even formatting/encoding forces you change the the architecture.


You're right.

That seems to explain why representations don't collapse into a constant, but not why they don't collapse to the same feature...


That's not quite how I understood it

> Do the cutout in feature space, not the original input space.

I think they do the cutout in the original input space, based on examples they show, e.g. they mask parts of text and grey out parts of an image.

> and make the network predict the missing part

I think predict the latent representation (as you say in (2)), but not of the missing part, but of the whole corrupted sample, and require it to be close to latent representation of the original input done by the teacher.


So they’re doing both masking for the inputs and knowledge distillation? Is it the combination of these two methods that’s novel?


You're right, they cut in the original space and average-pool the features before an L1/L2 loss


dang/mods: The title here has a small typo: it misspells algorithm.


Before doing that, it also misspells "first".


Also "algorithn".


Guys, sorry for the typos. The original title was too long. I needed to replace "First" to "1st" (HN switch back automatically, lol) and try to abbreviate.


Still doesn't explain "algorithn".


"n" is more narrow than "m", so the title would fit HN's title length limits.


N is right next to M in QWERTY keyboards. Typing fast, you may hit one instead of the other.


Why would they type it manually at all? My understanding is that they copypasted the title, was prompted it was too long, tried to short First to 1st which HN automatically expanded upon which they removed some words instead. So how did algorithm become "algorithn"?

This is not some random nitpicking. This is a great mystery worthy of its own detective TV show so I don't appreciate the downvotes. All my friends are extremely puzzled by this whole situation.


Kiro, perhaps I'm a bot from FB that posts our scientific achievements on HN. FB needs to improve my algorithn!


A subtle nod to the fact this was posted on hn


At first I thought it was about an algorithm to generate text with mistakes on purpose - and that it did a nice job.


It seems that they pass everything through an autoencoder first, and a different network tries to predict from a partially masked input the "correct" autoencoder latent space representation of the unmasked input. If it works, the decoder of the autoencoder can generate(guess) the unmasked data from the latent space.


Is there progress on general structured/relational/graphed data modalities?

In practice, you spend time and expertise to reform the data into previously-known-to-work form.

FWIW, our datasets are huge, with dense data/noise ratio.


Can you clarify a little bit on what you mean by your question?

One area of research is extracting an effective type theory from data that is viewed as a semantic model, eg sensor data of a phenomenon would lead to a type theory describing it.

You’re essentially taking a TDA persistent homology/covering and reinterpreting that through the lens of homotopy type theory to “decompile” your data. There’s some early results, eg connecting convolutions and type division.

But that’s overall at really early stages of research.


> effective type theory from data that is viewed as a semantic model,

Well, much simpler stuff.

You have a logistic dataset of objects/GPS/time. Fine. Now you add historic truck data which is location time series. If is not obvious how you can learn the delivery times between 2 items.

You need human expertise, and design a way to extract usable routes, and also solve multi-hop routes, and then maybe you can learn typical speed between 2 items.

It is doable. But it not with a generic "multi-modal architecture".


Naively, it sounds like the same problem — you’d want some kind of sheaf on your data to evaluate rules.

Detecting routes sounds like a persistent homology problem — and then you’re detecting flow across that inferred structure.

I think Michael Robinson has work in that area:

https://www.youtube.com/watch?v=b1Wu8kTngoE


Any papers cover this?


None published; I’m working on a white paper about type division in the abstract case this quarter, once I finish up the second one on shape algebras.

Type division is like convolution from ML, which is why we can recognize “shapes of shapes” and undo the product structure. (And arguably, another avenue towards arriving at the manifold hypothesis.)

I’m currently working on the “easy” direction of encoding the type statements to matrices, with the hope most steps are reversible. (So far, so good.)

Still rough white paper:

https://www.zmgsabstract.com/whitepapers/shapes-as-digital-i...

GitHub repo for encoding type theory models, still VERY early:

https://github.com/zmgsabstract/mathengine


Crazy, and people think AI isn't moving forward anymore..


Crazy, but who thinks AI is not moving forward? I don't think anyone on HN.



Sounds like the limit is computation speed, not conceptual. That should solve itself over the years.


Many people on HN think that, for instance, GPT-3 and that family of works does not represent any real advancement and continually disparage and poke holes in the output of the model. These threads are upvoted to the front page occasionally.


Lots of people think it's essentially a dead end as it's missing some kind of 'secret sauce' that the brain has and which we are just too dumb to figure out.


It's a lack of appreciation for how complicated the brain really is. The "secret sauce" is "it's really frickin' complex and we don't understand a fraction of it". It's like comparing an abacus to the latest AMD or Intel CPU and saying, "The abacus is a dead end because it's obviously missing that "secret sauce" that this magical computing oracle is doing"


I don't understand how this is different from BYOL? I'd appreciate it if someone could give a small explanation.


> Similar to our work, both BYOL (Grill et al., 2020) and DINO (Caron et al., 2021) regress neural network representations of a momentum encoder, but our work differs in that it uses a masked prediction task and we regress multiple neural network layer representations instead of just the top layer which we find to be more effective. Moreover, we demonstrate that our approach works for multiple modalities.

From the related works section of the paper.



Do they need a pre-trained modality-specific model for each modality to train this?


No, they don't need a pre-trained model, they start from scratch. Also it's only the approach which is common between modalities, they train them separately.


"Our work does not perform multimodal training but aims to unifiy (sic) the learning objective for self-supervised learning in different modalities."


Now do it with matrices GPT style

Sheeeeeeeeeeeeeeeeeeit




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: