The top comment there (from a structural biologist) is worth reading. Here's my opinion, as a computer scientist that worked in this area.
A protein sequence is analogous to a computer program, but the "machine" is a mostly-water solution, and the instructions are interpreted by summing up all the intermolecular forces at play as the sequence is squirted out of a little extruder as a string (as in the stuff your clothes are made of, not text). Bits of the string repel and attract each other and it globs up in some biologically useful structure. The folding problem is the problem of predicting that structure from the string.
Unlike the halting problem, there is no way to generate an execution trace saying a particular glob would be formed in reality. In fact, there is no way to perform a polynomial time check of the result, so we've already escaped the land of P and NP.
Also, things like temperature and proximity to other proteins mean that there might not be a unique fold for a given sequence. Therefore, like the halting problem, we have unknown inputs, and we need to figure out which states an arbitrary program can reach.
When someone claims to have "solved" folding, you should be as skeptical as you would be if someone claimed to have solved the halting problem for arbitrary machine code, and that they don't need any extra information about the machines that run that code. Although their program runs on conventional computers, it also works on programs written for quantum computers.
(Edit: That's not to say this work isn't useful, or that this press release overclaims. I've been hearing some pretty wild claims about this work elsewhere...)
For an unrelated reason, shortly after making that comment, I put 31 genes from a viral genome (the whole genome, assuming we have the reading frames correct and nothing else funky is going on) through AlphaFold. We're getting ready to do some proteomics to see what's in the capsid, and I wanted to inform the proteomics by doing some sequence analysis. Only three genes of the 31 came back with any sort of confidence. Two of the three were crystallized and solved by my group a few years back.
FWIU, Folding@home has additional problems for AlphaFold, if not the AlphaFold team;
> Install our software to become a citizen scientist and contribute your compute power to help fight global health threats like COVID19, Alzheimer’s Disease, and cancer. Our software is completely free, easy to install, and safe to use. Available for: Linux, Windows, Mac
> which started at Washington University in St. Louis, home of the Human Genome Project
While Wash U was a contributor, I am confused about why you call it the home of the Human Genome Project. The Project seems a lot more strongly linked to the Whitehead/MIT in terms of press and the site of key figures.
Together, these teams have achieved a very significant cost reduction: the link I shared cites a sub-$1K cost to sequence a genome today; a cost savings of millions of dollars per genome.
> Folding@home answers a related but different question. While AlphaFold returns the picture of a folded protein in its most energetically stable conformation, Folding@home returns a video of the protein undergoing folding, traversing its energy landscape.
Is there any NN architectural reason that AlphaFold could not learn and predict the Folding@home protein folding interactions as well? Is there yet an open implementation?
I think it would be much harder to do that, since it probably requires modelling physics at some level, while AlphaFold is really just mining statistical correlations of structures and sequences.
Yes, there are open implementations of nearly-AlphaFold at this point.
FWIU there's no algorithmic reason that AlphaZero-style self play w/ rules could not learn the quantum chemistry / physics. Given the infinite monkey theorem, can an e.g. bayesian NN learn quantum gravity enough to predictively model multibody planetary orbits given an additional solar mass in transit through the solar system? (What about with "try Bernoulli's on GR and call it superfluid quantum gravity" or "the bond yield-curve inversion is a known-good predictor, with lag" as Goal-programming nudges to distributedly-partitioned symbolic EA/GA with a cost/error/survival/fitness function?)
E.g. re-derivations of Lean Mathlib would be the strings to evolve.
Pretty much everything you said doesn't make any sense. Folding@Home started at Stanford, not WashU. WashU was also not "the home of the human genome project", that was a distributed effort. AlphaFold doesn't contribute to Folding@Home, it's an entirely different problem.
Disclaimer: I'm a professional (computational) structural biologist. My opinion is slightly different than the researcher that commented on the linked post.
I didn't see any claim by DeepMind that protein structure prediction is a solved problem. I think these guys are pretty diligent when it comes to communicating their science. What you may have seen, is a non-scientist reporter making inaccurate claims.
The problem with the structure prediction problem is not a loss/energy function problem, even if we had an accurate model of all the forces involved we'd still not have an accurate protein structure prediction algorithm.
Protein folding is a chaotic process (similar to the 3 body problem). There's an enormous number of interactions involved - between different amino acids, solvent and more. Numerical computation can't solve chaotic systems because floating point numbers have a finite representation, which leads to rounding errors and loss of accuracy.
Besides, Short range electro static and van der waals interactions are pretty well understood and before alphafold many algorithms (like Rosetta) were pretty successful in a lot of protein modeling tasks.
Therefore, we need a *practical* way to look at protein structure determination that is akin to AlphaFold2.
Now I really want to read a long form book like this comment ‘A Computer Scientists Guide to an intuitive understanding of biochemistry’
I’ve found it extremely hard to have a casual understanding of biology, unlike math where I feel like I have a solid high level sampling of the field. I’ve done a few bio and chemistry courses and books but it’s so deep and ill suited for a programmer who is used to asking how things work underneath at every level (you have to constantly stop yourself from asking why something does what it does and just go with it until it starts to connect later, which is more of a commitment than I could give).
I would suggest carefully reading a deep textbook on biology like Molecular Biology of the Cell. You can't get a casual but realistic understanding of biology without a significant effort. That's a big problem in modern society. Biology is subtle and yet ever-important to us earth-bound organisms. The vast majority of people have only the most trivial understanding of biology, but scientifically we have a rather complete perspective and mental model that, due to its recent development, hasn't yet become common.
Slightly OT, but I am a computational chemist (PhD) looking to learn more about molecular biology (to say, and undergraduate or beginning graduate level). I am looking to learn more to see ways in which advances in computational chemistry tools could be applicable outside of our usual domains.
I am looking at Molecular Biology of the Cell (Alberts) and Cell Biology (Pollard). Both were recommended to me, but wondering what the pros and cons of each are (if you are familiar with both of them).
I'm not familiar with Cell Biology by Pollard but MBoC has incredible diagrams and flow charts that make pathways and other concepts incredibly easy to understand
I would suggest taking MIT's Secret of Life course on EdX. Its taught by Eric Lander who was a key figure in the human genome project and was a mathematician beforehand, so he follows an axiomatic approach that is much different than the way other schools teach biology
Alternatively, Harvard Extension School has some great biology courses you can sign up and get credit for. Though those are mostly for pre-med career changers
- There is a (short) book called "A Computer Scientist's Guide to Cell Biology" by William Cohen which is a little pricey but very dense and helpful with a lot of concepts.
- Combine that with David Goodsell's "The Machinery of Life" which has a lot of great illustrations and practical examples.
Only way to truly learn biology imo is to read and do experiments. The feedback loop between those two things is what actually gives someone real intuition.
Is it possible or likely that the folding process is more procedurally deterministic than it seems? (given sequence, temperature etc) The degrees of freedom perhaps seem intractable because we don't know what steps the structure takes between the linear extrusion and final fold. AlphaFold, if I understand correctly, doesn't attempt to solve this problem. Your comment implies we should be skeptical of it because it's solving a potentially-intractable problem; perhaps it's both tractable, and AlphaFold doesn't solve it.
Let's say you have a car (or lego set etc). The number of possible ways the parts could go together are astronomical! Does that mean it's not possible to figure out how it fits together, or how you might build one?
Yes, if you have a Lego set, or a series of car parts, there are many ways to put them together to make something. What AF is doing as far as I understand is essentially looking at a catalog of all Lego sets ever produced, or all car models ever produced, and choosing one that most closely matches the pieces it is seeing.
But there is no reason to expect this process to produce the right end-result for a Lego set that has never been seen before.
Yes, but that competition is using lots of proteins that are similar to other known proteins, as far as I understand.
There is also a lot of sub-structure that helps - similar parts of proteins tend to fold in similar ways, so even if you don't have real predictive power on unknown sequences, you may do quite well for a protein that is 90% the same as one in the training set - you will be quite correct on ~90% of the folds, even if your pretty way off on the remaining 10%.
Note that all of this is not to minimize the success of what AlphaFold achieved. I am just trying to explain how you can do well at this problem without having discovered some deeper deterministic structure in protein folds.
Yes, but many proteins can be boiled down to basically two classes - the folded portion, and the unfolded portion. The folded portions are typically shared (shared is a loose term, there's a lot of leeway) among almost all proteins.
So, I can pull a protein out of thin air and there's a good chance it'll have an overall fold similar to another protein that's got a structure. Unfortunately, the devil is almost always in the details. An amino acid here or there, a short extension here or there, a missing charged residue or an extra glycine and now you have a different target and entirely different behavior in a biological system.
One cool thing I found actually, was a protein in an Archaeal virus had no known homology a few years ago, but when I checked the other day, it now matches most closely to an (otherwise thought to be) entirely synthetic protein out of David Baker's lab at UW. Which means this Archeal virus and David Baker converged on the same fold somehow (likely because it was "stable").
>When someone claims to have "solved" folding, you should be as skeptical as you would be if someone claimed to have solved the halting problem for arbitrary machine code
That's absurd. The halting problem is provably impossible with either conventional computers or Quantum computers.
This is clearly not true for protein folding, although it is possible that it is computationally intractable with a conventional computer.
I think the parent comment is saying that it's impossible to arrive at a specific folding endpoint because that state is dependent on continuously changing environmental variables.
Take a look at the configs for Amber (molecular dynamics simulation -- https://ambermd.org). QC might help map the space of inputs that would converge, but it probably couldn't identify a hypothetical 'done folding' state for any given protein.
I don't think it's super valuable to spend time thinking about the computational class protein folding (or structure prediction) is in. It's clear now that approaches that approximate the expensive physics and extended sampling using every bit of additional information available are going to be much more successful in providing data that people need from structures.
I propose this as a thought experiment: Nature has solved this. How? Some lines of reasoning:
#1: The quantum interactions of electrons that are the basis for chemical bonds behave in ways our computers and intuition are incapable of simulating
#2 It's a matter of degree, not kind, and nature is more sophisticated than our computers, reasoning and thought processes.
#3 Nature is magic, whatever you define that to be
#4 When stipulating the degrees of freedom involved (ie from dihedral angles), the possibility of additional information we haven't discovered is being overlooked. Is there a recipe or algorithm that could help?
#5 Proteins don't fold in isolation. We know some proteins need chaperone proteins to fold, for instance. Others form part of a complex. The problem can't be solved in the general case just based on the sequence of the protein you want to know the structure of. That's also a problem experimentally -- we don't know if the structure of a crystalized protein is really the biologically meaningful form.
I'd go with #1. Especially considering that there are quantum approaches to protein folding.
But nature hasn't really "solved" the problem, it is just doing its thing, but the way it does things is completely different from what our computers do.
It is like trying to reproduce a guitar sound using a synthesizer. A guitar solves to problem of sounding like a guitar, but it doesn't mean it is more sophisticated than a synthesizer, in fact, a synthesizer can do much more, it is just that the process by which the guitar makes sounds are hard to simulate.
Could be! Are you thinking thermodynamic fluctuations from surrounding water molecules jostling things around into many combinations? In this view, do you think the final protein would be found by chance, or through intermediate assemblies?
NP problems are ones whose positive solutions are verifiable in polynomial time.
For example, the problem "is there a route in this graph that visits all nodes and has length <= L" can be quickly verified with a classical computer, as long as you're given a "yes" answer accompanied by such a route. Finding the answer from scratch might be much slower, but checking it is quick.
A protein sequence is analogous to a computer program, but the "machine" is a mostly-water solution, and the instructions are interpreted by summing up all the intermolecular forces at play as the sequence is squirted out of a little extruder as a string (as in the stuff your clothes are made of, not text). Bits of the string repel and attract each other and it globs up in some biologically useful structure. The folding problem is the problem of predicting that structure from the string.
Unlike the halting problem, there is no way to generate an execution trace saying a particular glob would be formed in reality. In fact, there is no way to perform a polynomial time check of the result, so we've already escaped the land of P and NP.
Also, things like temperature and proximity to other proteins mean that there might not be a unique fold for a given sequence. Therefore, like the halting problem, we have unknown inputs, and we need to figure out which states an arbitrary program can reach.
When someone claims to have "solved" folding, you should be as skeptical as you would be if someone claimed to have solved the halting problem for arbitrary machine code, and that they don't need any extra information about the machines that run that code. Although their program runs on conventional computers, it also works on programs written for quantum computers.
(Edit: That's not to say this work isn't useful, or that this press release overclaims. I've been hearing some pretty wild claims about this work elsewhere...)