Biological Function Emerges from Unsupervised Learning on 250M Protein Sequences

gigantum · on April 30, 2019

Like some of the other ML/AI posts that made it to the top page today, this research too does not give any clear way to reproduce the results. I looked through the pre-print page as well as the full manuscript itself.

Without reproducibility and transparency in the code and data, the impact of this research is ultimately limited. No one else can recreate, iterate, and refine the results, nor can anyone rigorously evaluate the methodology used (besides giving a guess after reading a manuscript).

The year is 2019, many are finally realizing it's time to back up your results with code, data, and some kind of specification of the computing environment you're using. Science is about sharing your work for others in the research community to build upon. Leave the manuscript for the pretty formality.

Havoc · on April 30, 2019

>any clear way to reproduce the results.

Given that it's evolved I'd imagine this is a given? Or more accurately you could probably duplicate some kind of emergent behaviour but it would be different given different randomized parameters

threwawasy1228 · on April 30, 2019

More of what the point is I think is that they don't go into any meta-analysis of big changes that were seen in many of the trials. They don't try to isolate specific mechanisms that formed in a majority of trials that almost made it to this stage for example. They just don't really go into any analysis of the failure trees in trial dataset at all.

IMHO this is probably just a case of them trying to stretch this out across a bunch of different papers, and this is just the announce paper. Which is a shitty practice, but the current academic environment encourages taking good findings and puffing them up into multiple incomplete papers rather than one well-done paper.

lysium · on April 30, 2019

Usually you use an RNG for which you can publish the seed. So, although it’s random, you can reproduce the results.

tastroder · on May 1, 2019

Glancing through the paper it seems like they use the recent Transformer model. Does whatever underlying stack they use expose something to share RNG seeds and the exact hardware optimizations your environment applies during training? Otherwise "publishing the seed" sounds nice but might not be as trivial as the phrase suggests.

arthur_pryor · on May 1, 2019

reproducibility should be something that's baked into an experiment's design.

so, if their experiment was designed such that reproduction is inherently difficult, they should have designed it in a better way, and they should've used a toolset that wouldn't run into that problem.

a non-reproducible experiment isn't necessarily completely without value, but it's a thing that everyone should look askance at till it proves its worth.

(apologies if my comments don't apply to this experiment and if it is reproducible -- i didn't have time to read through the OP, but i thought this reply was still a worthwhile response to its specific parent comment)

tastroder · on May 1, 2019

No that's absolutely a fair and true point, my comment was more pointed at the RNG aspect. I have not looked into this specific one either but normally people would hopefully not publish their best randomly achieved run if the system cannot reproduce it or similar results.

That being said the paper in question doesn't seem to reference open source code anyway so I guess my point was kind of moot, apologies.

nl · on May 1, 2019

For the most part, yes.

There are specific CUDA operations which are not guaranteed to be reproducible though, as well as some CuDNN operations which are non-determanistic without performance sacrifice, and this does cause real problems.

See https://pytorch.org/docs/stable/notes/randomness.html for some reasonable docs on this.

seppel · on May 1, 2019

There are many CS conferences where you can/should submit a VM image to reproduce the results. See, e.g.: http://cavconference.org/2018/artifact-submission-and-evalua...

lysium · on May 1, 2019

You want to be able to set the seed if only you want to be able to debug your program. Pseudo random is sufficient for these models and is independent of any hardware settings. You should not share your random source between concurrent threads, though, but that’s good practice anyway.

londons_explore · on May 1, 2019

Most machine learning accelerators have a few non-deterministic operations. The chances that you could run trillions of floating point operations through a GPU and get a bit-for-bit identical result is low.

tempguy9999 · on May 1, 2019

Really? I'm not an ML guy so in simple terms, what are these non-deterministic ops? Or are you saying GPUs can be expected to be, basically, faulty?

londons_explore · on May 1, 2019

Both.

Some operations split and join data in non-deterministic ways (especially the order of operations, leading to different floating point rounding). If you shard across multiple machines, weight accumulation order will depend on network latency for example.

Also, GPU's aren't anywhere near as reliable as CPU's when it comes to being able to run for hours without any random bit flips/errors.

tempguy9999 · on May 2, 2019

> ...split and join data in non-deterministic ways ... to different floating point rounding

Ah, of course! A very timely reminder, thanks!

> GPU's aren't anywhere near as reliable as CPU's when it comes to being able to run for hours without any random bit flips/errors.

Now that's worrying. A bit flip can't be expected to be skewed towards any particular bit within a float, so it could easily happen in the exponent, skewing a single value by orders of magnitude one way or the other. Combine that with the rest of your 'good' results and yuck. That's very concerning. Thanks for the warning.

londons_explore · on May 2, 2019

> A bit flip can't be expected to be skewed towards any particular bit within a float,

Actually, I think they are - for example, the exponent path through an adder/multiplier is typically shorter, so when operated close to clock speed limits, the exponent is more likley to be correct.

(I've not actually verified the above on real hardware)

vanattab · on April 30, 2019

Is it not possible to use the same seed and random number generator to reproduce the results accurately?

visarga · on May 1, 2019

You got to be careful. RNG is being used to initialise the layers but also for mini-batch selection. They are usually different RNG's.

andbberger · on April 30, 2019

I find this paper to be so steeped in hype and dogma so as to be nearly incomprehensible.

Which is a shame, because it's a reasonable approach. I just wish they just frickin described what they did instead of spending the whole paper monologuing and showcasing unconvincing experiments. No need to justify what you're doing, just do it.

ArtWomb · on April 30, 2019

Fergus Lab at NYU. I believe he's across the hall from Yann LaCunn as well ;)

Still a long way from a Theory of Biogenesis. But a good next step is using a differentiable model to predict novel proteins which have no analogue in Nature. Much like Materials Genome researchers searching for stable phases of matter!

"Training ever bigger convnets and LSTMs on ever bigger datasets gets us closer to Strong AI -- in the same sense that building taller towers gets us closer to the moon." --François Chollet

visarga · on May 1, 2019

> "Training ever bigger convnets and LSTMs on ever bigger datasets gets us closer to Strong AI -- in the same sense that building taller towers gets us closer to the moon." --François Chollet

The Transformer layer has radically leaped over LSTMs and CNNs. While LSTMs can model sequences and CNNs regular grids, they have no efficient long range interaction mechanism. Transformer does. It's a huge leap similar to the one in computer vision from a few years ago.

What is needed besides spatial translation invariance (CNN) and temporal invariance (LSTM) is permutation invariance. Whenever the problem can be described as a graph, then the ordering of the vertices and edges should not matter. You can't do that with CNNs and LSTMs, but you can do it with Graph neural nets and Transformers.

Apparently Transformers are the best for language modelling (GPT-2), playing games (Dota2 from OpenAI), composing music and possibly now in modelling proteins. I assume they will play a huge role in working with graph structured data, with multiple entities and relations.

nl · on May 1, 2019

It's not really as clear cut as that.

Transformers work well in sequence tasks because both compare well in terms of accuracy but also scale better than a RNNs like a LSTM or a GRU. That means they can be trained on more data.

This isn't really the same as CNNs, where they model images by running at different scales. I'm not aware of any cases of Transformers being used particularly successfully on images.

They can be used on graphs of course, by translating the problem into a graph walk problem (ala DeepWalk).

All the examples you gave (language modelling, Dota2, music and protein modelling) are setup as sequence prediction problems, so are perfect for Transformers.

p1esk · on May 1, 2019

https://arxiv.org/abs/1904.09925

nl · on May 1, 2019

Nice. I guess I'm 8 days behind on the SOTA...

But I'd note that it is build on top of a CNN base (ResNet or RetinaNet) and that the Attention-only system performed slightly worse than the one including the CNN layers.

Also, this isn't really a Transformer architecture, even though it uses Attention.

But maybe this is too much nitpicking? I agree that Attention is a useful primitive - my point is that the Transformer architecture is too specific.

(Also, this is a really nice paper in that it lays out the hyperparameters and training schedules they used. And that Appendix is amazing!)

bryant · on April 30, 2019

> I believe he's across the hall from Yann LaCunn as well ;)

I'm having a hard time processing what the wink might possibly mean in this context.

No sarcasm intended.

mkolodny · on April 30, 2019

I'd guess that wink is hinting that Yann LeCun might've had something to do with this research. Whether that's true or not, I have no idea.

(Yann LeCun is a Turing award winner for his work in deep learning)

slashcom · on April 30, 2019

Yann LeCun did not, otherwise he’d be a coauthor. As it is, this was a collaboration between NYU and Facebook AI Research, with multiple authors working at both institutions.

drb91 · on May 1, 2019

My understanding is that academic authorship credit is political: authors don’t always contribute, and contributors don’t always get credit. Is this not the case?

nl · on May 1, 2019

Not really. Usually the politics goes the other way - people getting an author slot because they are the head of the department or something.

drb91 · on May 1, 2019

I believe I covered that—authors who don’t contribute.

Also, I know too many angry grad students to believe that contributors always get credit.

obviuosly · on April 30, 2019

> The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge.

A couple of questions:

1. What are those representations?

2. Also what is "biological function"?

3. What kind of information does the learned representation extract that is not already in the "biological properties" it is trained to map to?

tepal · on April 30, 2019

This blog post seems to anticipate this happening: https://moalquraishi.wordpress.com/2019/04/01/the-future-of-...

dnautics · on April 30, 2019

> It does a surprisingly good job of predicting protein function across a diverse set of tasks, including ones structural in nature, like the induction of a single neuron that is able, with some degree of accuracy (ρ = 0.33) to distinguish between α helices and β strands (I suspect the network as a whole is far more performant at this task than the single neuron we’ve identified, but we didn’t push this aspect of the analysis as the problem is well tackled using specialized approaches.)

I hate to be that guy, but distinguishing between alpha helices and beta strands is not really that hard.

It's a good start though. I would propose the following test: Let's see if we can use the activations from the neurons to predict the luminosity of a 'base' GFP molecule (under a fixed set of experimental conditions). Train the set on 10,000 mutations (this could maybe be done in very high throughput by tethering the XNA to a bead, synthesizing, and then measuring the beads one by one), and see if can extrapolate the effects of 10k more, or heck, just by doing it brute-forcedly, we've got high throughput robots, right?

jostmey · on April 30, 2019

And predicting protein function is not that hard either. The ground truth labels are often determined by sequence alignment similarity, not by experiment. So the results are far from profound

jerven · on May 2, 2019

Doing it right is quite hard. Doing it usefully is even harder [1]. Getting a good training set without to many biases is the really hard part. Generating a ground truth that is actually a truth is very expensive.

I have to read the paper carefully again. But for the contact point prediction I think the training set will cover most of the data used in the validation. Due to they way PDB "sequences" are distributed over UniParc as well as how PDB 3D structures are generated experimentally. i.e. there are 120,000 pdb related sequences in UniParc, but they cover 45,000 ones in UniProtKB. Because PDB derived sequences are rarely full length, often mutated and highly duplicative in coverage.

[1] predicting the root GO terms will give you and insane TP/FP rate but is completely useless.

superfx · on April 30, 2019

See also: https://www.biorxiv.org/content/10.1101/589333v1

shpongled · on April 30, 2019

This is cool, but would be significantly cooler if they did some kind of biological follow up. Perhaps getting their model to output an "ideal" sequence for a desired enzymatic function and then swapping that domain into an existing protein lacking the new function.

inciampati · on April 30, 2019

Bingo. That would be really interesting. And useful.

There are probably already enzymes in this data set that have measurements of their behavior. Could this modelling approach be coaxed to find the one with the highest processivity? Or do we need more labeled data?

shpongled · on April 30, 2019

I'm sure they have a bunch of enzymes in their dataset for which kinetic measurements have been published. Another interesting follow up study would attempting to improve kinetic behavior. They could, for instance, analyze some of the catalytically perfect enzymes out there (TIM, SOD, catalase, etc) and see if the model could project improvements onto existing orthogonal protein classes.

jerven · on May 2, 2019

Not in a structured way that is easily useable. Swiss-Prot has most of this data but it is not quite normalized in units. If you did this annoying work I would like to talk to you so we can plug it into Swiss-Prot.

lucidrains · on April 30, 2019

Language, music, and now amino acid sequences. Attention is all you need.

mfatica · on April 30, 2019

I would say you also need a fair bit of data too...

bearmcbearsly · on April 30, 2019

Well, yes. But I think lucidrains was referring to:

https://arxiv.org/abs/1706.03762

return1 · on April 30, 2019

and transformers

nl · on May 1, 2019

The Attention Is All You Need paper is where Transforms were introduced:

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

return1 · on May 2, 2019

Yup, and the state of the art BERT and gpt-2 are both based on transformers.

a_bonobo · on May 1, 2019

Here's a very cool GitHub repository which uses unsupervised learning (ULMFiT) in the genomics space: https://github.com/kheyer/Genomic-ULMFiT

Very impressive accuracies on hard tasks, and it's open source!

cellular · on April 30, 2019

I find these emergent behaviours fascinating: https://youtu.be/gaFKqOBTj9w

jakeogh · on May 1, 2019

FPGA do interesting things when allowed to exploit sidechannels/analog effects: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50....

cellular · on April 30, 2019

It would be neat to run this with a few more rules, on a larger world, and for a longer time to see what emerges!