Like some of the other ML/AI posts that made it to the top page today, this research too does not give any clear way to reproduce the results. I looked through the pre-print page as well as the full manuscript itself.
Without reproducibility and transparency in the code and data, the impact of this research is ultimately limited. No one else can recreate, iterate, and refine the results, nor can anyone rigorously evaluate the methodology used (besides giving a guess after reading a manuscript).
The year is 2019, many are finally realizing it's time to back up your results with code, data, and some kind of specification of the computing environment you're using. Science is about sharing your work for others in the research community to build upon. Leave the manuscript for the pretty formality.
Given that it's evolved I'd imagine this is a given? Or more accurately you could probably duplicate some kind of emergent behaviour but it would be different given different randomized parameters
More of what the point is I think is that they don't go into any meta-analysis of big changes that were seen in many of the trials. They don't try to isolate specific mechanisms that formed in a majority of trials that almost made it to this stage for example. They just don't really go into any analysis of the failure trees in trial dataset at all.
IMHO this is probably just a case of them trying to stretch this out across a bunch of different papers, and this is just the announce paper. Which is a shitty practice, but the current academic environment encourages taking good findings and puffing them up into multiple incomplete papers rather than one well-done paper.
Glancing through the paper it seems like they use the recent Transformer model. Does whatever underlying stack they use expose something to share RNG seeds and the exact hardware optimizations your environment applies during training? Otherwise "publishing the seed" sounds nice but might not be as trivial as the phrase suggests.
reproducibility should be something that's baked into an experiment's design.
so, if their experiment was designed such that reproduction is inherently difficult, they should have designed it in a better way, and they should've used a toolset that wouldn't run into that problem.
a non-reproducible experiment isn't necessarily completely without value, but it's a thing that everyone should look askance at till it proves its worth.
(apologies if my comments don't apply to this experiment and if it is reproducible -- i didn't have time to read through the OP, but i thought this reply was still a worthwhile response to its specific parent comment)
No that's absolutely a fair and true point, my comment was more pointed at the RNG aspect. I have not looked into this specific one either but normally people would hopefully not publish their best randomly achieved run if the system cannot reproduce it or similar results.
That being said the paper in question doesn't seem to reference open source code anyway so I guess my point was kind of moot, apologies.
There are specific CUDA operations which are not guaranteed to be reproducible though, as well as some CuDNN operations which are non-determanistic without performance sacrifice, and this does cause real problems.
You want to be able to set the seed if only you want to be able to debug your program. Pseudo random is sufficient for these models and is independent of any hardware settings. You should not share your random source between concurrent threads, though, but that’s good practice anyway.
Most machine learning accelerators have a few non-deterministic operations. The chances that you could run trillions of floating point operations through a GPU and get a bit-for-bit identical result is low.
Some operations split and join data in non-deterministic ways (especially the order of operations, leading to different floating point rounding). If you shard across multiple machines, weight accumulation order will depend on network latency for example.
Also, GPU's aren't anywhere near as reliable as CPU's when it comes to being able to run for hours without any random bit flips/errors.
> ...split and join data in non-deterministic ways ... to different floating point rounding
Ah, of course! A very timely reminder, thanks!
> GPU's aren't anywhere near as reliable as CPU's when it comes to being able to run for hours without any random bit flips/errors.
Now that's worrying. A bit flip can't be expected to be skewed towards any particular bit within a float, so it could easily happen in the exponent, skewing a single value by orders of magnitude one way or the other. Combine that with the rest of your 'good' results and yuck. That's very concerning. Thanks for the warning.
> A bit flip can't be expected to be skewed towards any particular bit within a float,
Actually, I think they are - for example, the exponent path through an adder/multiplier is typically shorter, so when operated close to clock speed limits, the exponent is more likley to be correct.
(I've not actually verified the above on real hardware)
I find this paper to be so steeped in hype and dogma so as to be nearly incomprehensible.
Which is a shame, because it's a reasonable approach. I just wish they just frickin described what they did instead of spending the whole paper monologuing and showcasing unconvincing experiments. No need to justify what you're doing, just do it.
Fergus Lab at NYU. I believe he's across the hall from Yann LaCunn as well ;)
Still a long way from a Theory of Biogenesis. But a good next step is using a differentiable model to predict novel proteins which have no analogue in Nature. Much like Materials Genome researchers searching for stable phases of matter!
"Training ever bigger convnets and LSTMs on ever bigger datasets gets us closer to Strong AI -- in the same sense that building taller towers gets us closer to the moon." --François Chollet
> "Training ever bigger convnets and LSTMs on ever bigger datasets gets us closer to Strong AI -- in the same sense that building taller towers gets us closer to the moon." --François Chollet
The Transformer layer has radically leaped over LSTMs and CNNs. While LSTMs can model sequences and CNNs regular grids, they have no efficient long range interaction mechanism. Transformer does. It's a huge leap similar to the one in computer vision from a few years ago.
What is needed besides spatial translation invariance (CNN) and temporal invariance (LSTM) is permutation invariance. Whenever the problem can be described as a graph, then the ordering of the vertices and edges should not matter. You can't do that with CNNs and LSTMs, but you can do it with Graph neural nets and Transformers.
Apparently Transformers are the best for language modelling (GPT-2), playing games (Dota2 from OpenAI), composing music and possibly now in modelling proteins. I assume they will play a huge role in working with graph structured data, with multiple entities and relations.
Transformers work well in sequence tasks because both compare well in terms of accuracy but also scale better than a RNNs like a LSTM or a GRU. That means they can be trained on more data.
This isn't really the same as CNNs, where they model images by running at different scales. I'm not aware of any cases of Transformers being used particularly successfully on images.
They can be used on graphs of course, by translating the problem into a graph walk problem (ala DeepWalk).
All the examples you gave (language modelling, Dota2, music and protein modelling) are setup as sequence prediction problems, so are perfect for Transformers.
But I'd note that it is build on top of a CNN base (ResNet or RetinaNet) and that the Attention-only system performed slightly worse than the one including the CNN layers.
Also, this isn't really a Transformer architecture, even though it uses Attention.
But maybe this is too much nitpicking? I agree that Attention is a useful primitive - my point is that the Transformer architecture is too specific.
(Also, this is a really nice paper in that it lays out the hyperparameters and training schedules they used. And that Appendix is amazing!)
Yann LeCun did not, otherwise he’d be a coauthor. As it is, this was a collaboration between NYU and Facebook AI Research, with multiple authors working at both institutions.
My understanding is that academic authorship credit is political: authors don’t always contribute, and contributors don’t always get credit. Is this not the case?
> It does a surprisingly good job of predicting protein function across a diverse set of tasks, including ones structural in nature, like the induction of a single neuron that is able, with some degree of accuracy (ρ = 0.33) to distinguish between α helices and β strands (I suspect the network as a whole is far more performant at this task than the single neuron we’ve identified, but we didn’t push this aspect of the analysis as the problem is well tackled using specialized approaches.)
I hate to be that guy, but distinguishing between alpha helices and beta strands is not really that hard.
It's a good start though. I would propose the following test: Let's see if we can use the activations from the neurons to predict the luminosity of a 'base' GFP molecule (under a fixed set of experimental conditions). Train the set on 10,000 mutations (this could maybe be done in very high throughput by tethering the XNA to a bead, synthesizing, and then measuring the beads one by one), and see if can extrapolate the effects of 10k more, or heck, just by doing it brute-forcedly, we've got high throughput robots, right?
And predicting protein function is not that hard either. The ground truth labels are often determined by sequence alignment similarity, not by experiment. So the results are far from profound
Doing it right is quite hard.
Doing it usefully is even harder [1].
Getting a good training set without to many biases is the really hard part.
Generating a ground truth that is actually a truth is very expensive.
I have to read the paper carefully again. But for the contact point prediction I think the training set will cover most of the data used in the validation. Due to they way PDB "sequences" are distributed over UniParc as well as how PDB 3D structures are generated experimentally. i.e. there are 120,000 pdb related sequences in UniParc, but they cover 45,000 ones in UniProtKB. Because PDB derived sequences are rarely full length, often mutated and highly duplicative in coverage.
[1] predicting the root GO terms will give you and insane TP/FP rate but is completely useless.
This is cool, but would be significantly cooler if they did some kind of biological follow up. Perhaps getting their model to output an "ideal" sequence for a desired enzymatic function and then swapping that domain into an existing protein lacking the new function.
Bingo. That would be really interesting. And useful.
There are probably already enzymes in this data set that have measurements of their behavior. Could this modelling approach be coaxed to find the one with the highest processivity? Or do we need more labeled data?
I'm sure they have a bunch of enzymes in their dataset for which kinetic measurements have been published. Another interesting follow up study would attempting to improve kinetic behavior. They could, for instance, analyze some of the catalytically perfect enzymes out there (TIM, SOD, catalase, etc) and see if the model could project improvements onto existing orthogonal protein classes.
Not in a structured way that is easily useable. Swiss-Prot has most of this data but it is not quite normalized in units. If you did this annoying work I would like to talk to you so we can plug it into Swiss-Prot.
The Attention Is All You Need paper is where Transforms were introduced:
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
Without reproducibility and transparency in the code and data, the impact of this research is ultimately limited. No one else can recreate, iterate, and refine the results, nor can anyone rigorously evaluate the methodology used (besides giving a guess after reading a manuscript).
The year is 2019, many are finally realizing it's time to back up your results with code, data, and some kind of specification of the computing environment you're using. Science is about sharing your work for others in the research community to build upon. Leave the manuscript for the pretty formality.