Deep learning can debug biology

posterboy · on Oct 15, 2016

I guess this is deliberately written to be cryptic, as it's an advertisement. Terms are used before they are introduced, but maybe I'm not in the target group. What's called prediction would technically be inference. Hand wavy explanations of machine learning used for differentiation of chemical analysis data seem to be the sell here.

saurabh20n · on Oct 15, 2016

Agree that this is not written with the rigor of a journal paper. Our intention was to communicate the simple wins we've had employing deep learning in comparison to the tools of the prior generation. XCMS is the most widely used library: http://www.bioconductor.org/packages/release/bioc/html/xcms..... It requires very painful parameter tuning. Internally, we had also built our own custom targeted analysis. In the targeted pipeline, we had to pre-specify "acetaminophen, shikimate, chorismate...". After building this deep learning workflow, we have exclusively switched over to it: no chemicals pre-specified, no parameter tuning. With about 185 engineered yeast that need analysis, with replicates, feeding conditions, and controls, these simplifications have been helpful.

We are getting easy wins over microbial data. Human data is noisier and we're testing over that now. More later.

If you have microbial data, or have used XCMS in the past and would like to compare, happy to chat. email me at saurabh@20n.

sndean · on Oct 16, 2016

I should read some of the related papers first, but would some of these techniques potentially be useful for LC-MS/MS proteomic data?

I'll probably you with more specific questions

karmel · on Oct 16, 2016

It might be the lack of detail in the piece, but it's unclear to me why this isn't a hammer to kill a fly-- that is, why wouldn't a much simpler peak-finding algorithm be appropriate here? What is the NN doing that's more than just peak finding over many molecules? Is there some interdependency that I am missing, or is this just signal-processing over millions of independent traces?

dre85 · on Oct 16, 2016

As far as I understand, NN is used here to find patterns which discriminate between sample cohorts (healthy vs disease). Peak finding gives you a list of peaks, but it doesn't tell you which of them discriminate between cohorts.

dre85 · on Oct 16, 2016

I find this very interesting. As a related topic, would it be possible to use deep learning to classify samples based on the quantities of pre-identified chemicals? If so, how would this work roughly? Does anybody have any ideas? Traditionally people use linear discriminant analysis, PCA, PLS, etc. I can't really wrap my head around the use of multiple neutral network layers for such problems.

saurabh20n · on Oct 16, 2016

One possibility is as an extension of the untargeted analysis: run the analysis over different kinds of samples. The output for each sample is the list of major peaks (and intensities). Use this as the "image" to train a (shallow) network.

You might even get away without specifying pre-identified chemicals. Adding that list would only help.

100ideas · on Oct 16, 2016

Who's working on ChemStructure2Vec? Could the Word2Vec approach be used to predict novel structures with functions in-between desired sets of known chemicals?

flipperkid · on Oct 16, 2016

From what I've seen, most use molecular fingerprints or cheminformatic descriptors such as RDKit provides. Google and Vijay Pande's group at Stanford had a recent publication on Molecular Graph Convolutions. I believe a lot of interesting research will come out in the area of molecular feature's over the next few years.

http://research.google.com/pubs/pub45548.html

cing · on Oct 16, 2016

People are working on such things! https://arxiv.org/abs/1610.02415

100ideas · on Oct 26, 2016

Wow! Thanks for the link.

Can we map the latent chemical space directly into a word space (can we ensure grammatical correctness? semantic correctness?)

> "We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation... We train deep neural networks on hundreds of thousands of existing chemical structures to construct two coupled functions: an encoder and a decoder. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to the discrete representation from this latent space..."

Consider the set of all chemical structures composed of between 0-6 carbon atoms, 0-4 oxygen atoms, 0-12 hydrogen atoms, 0-4 nitrogen atoms, and 0-1 "functional groups" expressed in the SMILES chemical nomenclature; for instance Alanine could be represented as O=C(O)C(N)C; Serine as C([C@@H](C(=O)O)N)O

Can we create a correspondence function that takes a given SMILES chemical structure, such as O=C(O)C(N)C, and returns a sequence of english words that encodes the same structure, such that, read left to right:

  - each atom is represented by a noun or adjective that starts with the same letter(s) as the atomic symbol; (pronouns don't count)
  - double or triple bonds are represented with prepositions 
  - charge is represented numerically
  - branches are represented by subordinating conjunctions
  - verbs and articles and other parts of speech (non-subordinating conjunctions, pronouns) can be used freely for grammatical correctness

it's hard to select word strings that are semantically meaningful sentences, but haiku-like forms are easier.

So Alanine ( O=C(O)C(N)C ) might be:

  (1)  O-Adjective/Noun preposition proposition C-Adjective/Noun subordinating-conjunction O-Adjective/Noun; 
  (2)  C-Adjective/Noun, subordinating-conjunction N-Adjective/Noun; 
  (3)  C-Adjective/Noun.

  (1)  Oranges blossom under and above Cherries where the Orangutans roam; 
  (2)  Conscious, unless Nocturnal; 
  (3)  Change coming."

Clearly many interesting structures could be expressed with semantically-invalid sentences.

But conversely, how often are "interesting" sentences chemically pointless or invalid?

Could we bias sentence construction to be more interesting by constraining it with the semantic vector space of a big work of literature, such as phrases found in Infinite Jest or the collected works of Shakespeare?