In a First, an Entire Organism Is Simulated by Software

alexholehouse · on July 20, 2012

This is very exciting progress, but let's remember that there is a rather important caveat with this sort of thing;

"For their computer simulation, the researchers had the advantage of extensive scientific literature on the bacterium. They were able to use data taken from more than 900 scientific papers to validate the accuracy of their software model."

This is not (necessarily) a model of M. genitalium - it's a model of our understanding of M. genitalium, and as such incorporates in everything that current technology in biological sciences allows us to look at. It takes a huge amount of data from many sources and tries to bring it together on a scale not previously done. However, that data may have significant flaws, biases and ultimately considers only what we're good at looking at (technologically/scientifically speaking). It's awesome, and 100% the right direction for the field, but equally it is not a, "synthetic life form being simulated" as much as, "a very very complicated model which uses huge amounts of multidimensional data to try and replicate the behavior seen in that data".

kens · on July 20, 2012

That's true of any simulation or model - it's only as good as the data that goes into it.

The exciting thing is that with this model, they can now rapidly iterate between the behavior of the model and the behavior of the real organism, see where the gaps in knowledge are, and work to fill them in.

(I'm happy to see M. Genitalium getting attention, since I made a Java genome display for it 14 years ago http://www.righto.com/java/genome/MG.html)

pbhjpbhj · on July 21, 2012

>That's true of any simulation or model -it's only as good as the data that goes into it. //

Models can be predictive, indeed that's one of the main reasons to create a model to observe behaviour that otherwise might not have been expected.

A theory in physics for example is a model with proven predictive powers, the model [theory] 'knows' some aspects of what can be observed before those things can be confirmed empirically.

But as you indicate new empirical data can shows flaws in a model just as it can in a scientific theory.

jkarr · on July 20, 2012

The model is "M. genitalium" in the sense that it attempts to predict M. genitalium's phenotype from its genotype. In order to build the model we needed to use data from many additional organisms. In that sense, the model represents a hybrid of all of these organisms.

alexholehouse · on July 25, 2012

Agreed, and, for the record (as someone who's worked on modelling at various scales) I didn't for a second want to detract from this achievement - it really is incredible, and, as per usual, the discussion here shows that the HN community is more than au fait with the implicit assumptions made.

Also, and this is something I personally feel very strongly about, the paper is incredibly well written and accessible. There is no point making science difficult to understand. Science is (itself) not difficult - it's complicated, encompassing, and confusing, but nothing is actually difficult. There is a trend to use big words and complex language in scientific papers, which I worry must be incredibly off-putting for those less used to them, and are almost always totally unnecessary. This paper was just beautiful (I devoured it on my subway ride home) and certainly one I'll be recommending to friends interested in the systems biology field.

jacobolus · on July 21, 2012

“The best ... model of a cat is another, or preferably the same, cat.” Rosenblueth and Wiener, 1945. http://www.csee.wvu.edu/~xinl/papers/role_model.pdf

pbhjpbhj · on July 21, 2012

Pithy but flawed.

The best representation of a thing is that same thing in the state you're trying to represent. The cat is not a model, a dead cat is not a great model of a live one.

apl · on July 20, 2012

With a data base consisting of hundreds of largely independent findings, it'd take one hell of a systematic bias to not get at least something right. Granted, they're obviously abstracting here, but we should avoid underplaying this achievement as much as overestimating it.

starpilot · on July 20, 2012

It sounds like the model is more postdictive than predictive. Not useless, but more like measuring a huge number of right triangles to determine the ratio of side lengths, versus deriving the pythagorean theorem.

Dn_Ab · on July 20, 2012

There is perspective and there is selling a work short. This is much more than a blind statistical model of the parameters. In fact, your criticism would apply just as well if not more to the Higgs search methodology.

They develop a number of submodules that each use an appropriate algorithm or branch of mathematics to best represent the interaction. To integrate them, results from the previous time-step feed each model as appropriate. The derivation of the correct mathematics and algorithm to use and the best way to connect them is anything but measuring a bunch of sticks.

Plus they were able to do some predictive modelling and pointed a direction for a novel result. From the abstract: "experimental analysis directed by model predictions identified previously undetected kinetic parameters and biological functions." This is not about simulating organisms, it is for accelerating experimental discovery of interactions, dynamics, parameters etc. This is a proof of concept that a technique which may one day contribute to saving lives is viable.

fghh45sdfhr3 · on July 20, 2012

Exactly. We can't even calculate the exact 3d shape of any large protein, much less exactly how it would interact with all other proteins inside a cell. This is an approximation. A great scientific success for sure, but not an exact simulation of all the parts of a single cell.

timr · on July 20, 2012

"It's awesome, and 100% the right direction for the field"

Awesome, maybe. But "100% the right direction for the field" is a stretch. This is press-release bait. What hypothesis did it address? What theory is it advancing? (other than "we theorize that we can simulate a cell in a computer...ohlook, we did it!", of course.)

This is interesting from a technology perspective, perhaps, but it's hard to call it science.

hobin · on July 20, 2012

"This is interesting from a technology perspective, perhaps, but it's hard to call it science."

As a physicist and scientist myself, I feel extremely inclined to say 'potato, potahto' at this one. This technology allows us to test many more hypotheses at a much quicker pace, which is good enough for me to call it science.

timr · on July 20, 2012

This technology allows us to test many more hypotheses at a much quicker pace"

I think you're seriously overestimating the quality of the model. This isn't particle physics -- it's not as if there's an equation that precisely predicts the results of any particular molecular interaction. We don't even know if we know the full set of possible interactions in a system of this size. How can we possibly simulate it?

This is an example of "we ran the fancy machine for a while, it spit out some data, and we cherry-picked 'interesting' results from the output. Some of them even show up in experiments!" These sorts of breathless announcements are common in computational biology, but they usually don't amount to much. The best work is extremely reductionist. As others have already noted, it's easy to spend 10x the computational resources simulating a single protein on timescales far, far shorter than the ones simulated here.

hobin · on July 20, 2012

"I think you're seriously overestimating the quality of the model. This isn't particle physics -- it's not as if there's an equation that precisely predicts the results of any particular molecular interaction. We don't even know if we know the full set of possible interactions in a system of this size. How can we possibly simulate it?"

(Edit: there was a whole bunch of stuff here, but it's pretty much irrelevant given that we seem to agree on the basic points.)

"This is an example of "we ran the fancy machine for a while, it spit out some data, and we cherry-picked 'interesting' results from the output. Some of them even show up in experiments!" These sorts of breathless announcements are common in computational biology, but they usually don't amount to much. The best work is extremely reductionist. As others have already noted, it's easy to spend 10x the computational resources simulating a single protein on timescales far, far shorter than the ones simulated here."

That's fair enough. I checked your profile and saw that you have a PhD in computational biology, which is not my field of expertise, so I'm inclined to take your word on that one. In that case, my comments do, of course, not apply.

pbhjpbhj · on July 21, 2012

But the hypothesis is being tested against the model and not against reality. The model can help to indicate the result but in my opinion it's not scientific advance until tested against reality.

Now there may be a question as to whether we can ever truly test a hypothesis against reality or if we're stuck with models ...

Scene_Cast2 · on July 20, 2012

It would start to get interesting if the simulation could predict results outside the scope of the input data. E.g. no-one tried putting the cell into environment X and using stimuli Y, see how well the simulation can predict the results.

This is the main point of interest for me personally.

Volpe · on July 21, 2012

That's true for anything that exists in the real world. Essentially you are suggesting that reproducing something empirically is flawed because empiricism is flawed. No?

gauravk92 · on July 21, 2012

Baby steps.

gliese1337 · on July 20, 2012

    Currently it takes about 9 to 10 hours of computer time to simulate a single division of the smallest cell — about the same time the cell takes to divide in its natural environment.

So, it's a realtime simulation! Cool, although wholly incidental.

It's hard to tell from the article exactly how fine-grained the simulation actually was. Was it actually at the level of tracking individual molecules, or did each "cell object" just keep track of, e.g., concentrations of different types of molecules?

Also, I wonder if this will have any long-term impact on things like the Open Worm project ( www.openworm.org ).

jkarr · on July 20, 2012

We used a combination of tracking concentrations and individual molecules. For specific macromolecules like DNA polymerase, RNA polymerase, ribosomes, etc we kept track of the position of each individual molecule. For other things like glucose, water, etc with very high copy number we just kept track of the copy number.

I agree modelling entire organisms will ultimately require large collaborations, potentially through projects like Open Worm.

karpathy · on July 20, 2012

Agreed, i spent a while looking for this. Looking at Figure 1B in the paper though suggests that it is the latter. There are no actual molecules bouncing around.

apl · on July 20, 2012

http://wholecell.stanford.edu/

Source code and training data available online; written, of course, in MATLAB. Very refreshing, and I'm looking forward to dissecting this first-hand.

joe_the_user · on July 20, 2012

Wow,

Having the code in Matlab seems like a disaster as far as ever making this or similar approaches modular and so usable-by-others goes. And I do know by miserable experience that Matlab indeed what biologists generally use but if biology is ever going to interface with larger scale software construction, it seems like it is going to have to change it's standard operations a bit.

Edit: And this isn't saying Matlab is generically "horrible". It is great at what it does but horrible from my perspective, as a programmer whose task usually is putting pieces of software together.

jkarr · on July 20, 2012

Agreed. MATLAB its set of downsides. On the plus side development is rapid and since 2008 it has good support for classes. We're starting to move toward python for future work.

I think MATLAB itself isn't so bad, so much as the way its typically used -- poorly commented, organized, and untested. We put a lot of effort into clearly commenting and organizing the code. We used matlab-xunit and hudson to manage testing, which worked quite well. http://www.mathworks.com/matlabcentral/fileexchange/22846-ma..., http://www.mathworks.com/matlabcentral/fileexchange/33971-xm....

apl · on July 20, 2012

  > We're starting to move toward python for future work.

Thanks a lot for commenting. As somebody who does a lot of scientific coding in Python and part-time contributor to a couple of projects, I'm of course curious -- what's your reasoning behind the move and where do you anticipate to see the most friction?

jkarr · on July 22, 2012

A variety of reasons really. In no particular order: - Still has good math support, rapid development - Better OOP support - More libraries -- everything from web development to scientific computing to GUIs to databasing - More reliable, less buggy - Better development tools -- code completion, testing, coverage, profiling, etc - Free. Therefore can more easily be run on clusters without licensing issues or using the MATLAB MCR

I don't see any friction, except if one needed to port old code. It seems that a lot of people are moving toward SciPy/NumPy these days.

apl · on July 20, 2012

  > the code in Matlab seems like a disaster as far as ever
  > making this or similar approaches modular and so usable-
  > by-others goes.

As always, it depends. I've seen very well maintained MATLAB code bases, and I've seen the opposite (with the latter greatly outweighing the former). We should give these guys the benefit of the doubt. Somewhere Karr mentions Hudson CI, so they don't seem fully removed from good practices. Interfacing with MATLAB from C is reasonable.

In an ideal world, this would be a NumPy/SciPy prestige project, but neither the community nor Python for that matter are quite there yet.

joe_the_user · on July 20, 2012

The thing is that Matlab is problematic for further extension even if the basic code is well done simply because of Matlab's weird function composition architecture and other weirdnesses.

Consider that if biological simulations are going to go to a larger scale, you won't want to simply call a bunch of Matlab simulations from a single C program but rather have a bunch of distinct programs that would be modified to call each other (with each of these programs running as they do now in their Matlab instance).

nagrom · on July 20, 2012

I'm confused. In what way is Python not there? If one specifies, e.g. Python 2.7, what do you consider to be the short-comings?

I'd also like to know how you think numpy/scipy are short compared to Matlab. I've not run into their limitations yet.

lazyjeff · on July 21, 2012

I think the main thing is libraries. I was looking for graphical model software that supported DBNs once, and could only find Matlab ones. There was a python wrapper for one of the Matlab libraries but was not widely supported.

cemerick · on July 21, 2012

I don't know much about biology or Matlab, but this seems to be a common refrain.

    *XXXXXists make huge breakthrough on YYYYY*

    "I can't believe they used programming language Z!"

Domain specialists generally have more market leverage than the random programmer geek that scoffs at their "primitive" tools, and the code they write is there to do a job, not satisfy the peculiar preferences of a specialist in some other domain (like programming).

Now, when programmers work to make domain specialists' job easier, that gets some attention; thus the tidal levels of interest in MDA, DSLs, and so on.

fruchtose · on July 22, 2012

Absolutely agreed. If you're doing large-scale simulation, you need explicit parallel computing. You need C/C++. MPI, ScaLAPACK, the whole shebang. 128 nodes is great, but I suspect the simulation will run orders of magnitude faster if written in C/C++.

jkarr · on July 20, 2012

All of the source code is available at http://simtk.org/home/wholecell. http://wholecell.stanford.edu provides a searchable interface to our training set database.

Cieplak · on July 20, 2012

I would love to read the article, but $31.50 is a bit steep.

http://www.sciencedirect.com/science/article/pii/S0092867412...

Thanks, Elsevier.

gwern · on July 20, 2012

http://dl.dropbox.com/u/85192141/2012-karr.pdf

Cieplak · on July 20, 2012

Cheers!

jkarr · on July 20, 2012

All of the supplementary material is available at http://simtk.org/home/wholecell. http://wholecell.stanford.edu provides a search interface to all of the training data.

joe_the_user · on July 20, 2012

My vague Googling gets that a human being has 50-100 trillion cells. So Moore's law would say a full human simulation might be 80 years away if it keeps up. I wonder if an organ could be simulated taking a smaller selection of cells, determining their characteristic interactions and extrapolating from there.

The further question with any project for full organism simulation would be how many lines of code are going to be produced and what would the process of maintaining that code look like?

praxulus · on July 20, 2012

On the one hand, a human cell is many times more complex than the cell simulated here (you could say that each of the hundreds of organelles in a human cell are closer in size and complexity to a bacterium than the whole cell). On the other hand, you could throw a datacenter at the problem, and get quite a leap over the 128-machine cluster used by the researchers.

You're probably right about the hierarchical simulation though. I have trouble believing that it's actually useful to model an entire body at a macro-molecular level. It would be like modeling electrons in circuit design software, rather than using the abstractions of voltage and current.

schiffern · on July 20, 2012

It gets tricky because the human body is an evolved system.

Human-designed systems need to be human comprehensible, so they're laid out in neat hierarchical layers, each abstracting away the complexity of the underlying process. Evolved systems aren't restricted by comprehensibility.

http://www.damninteresting.com/on-the-origin-of-circuits/

>Finally, after just over 4,000 generations, the test system settled upon the best program. When Dr. Thompson played the 1kHz tone, the microchip unfailingly reacted by decreasing its power output to zero volts. When he played the 10kHz tone, the output jumped up to five volts. … And no one had the foggiest notion how it worked.

>Dr. Thompson peered inside his perfect offspring to gain insight into its methods, but what he found inside was baffling. The plucky chip was utilizing only thirty-seven of its one hundred logic gates, and most of them were arranged in a curious collection of feedback loops. Five individual logic cells were functionally disconnected from the rest– with no pathways that would allow them to influence the output– yet when the researcher disabled any one of them the chip lost its ability to discriminate the tones. Furthermore, the final program did not work reliably when it was loaded onto other FPGAs of the same type.

joe_the_user · on July 20, 2012

It will be interesting to see what's possible when or if we have supercomputers an order of magnitude or two more powerful than the present ones. The problem of producing software of similarly larger size is naturally daunting.

I think voltage etc is airtight abstraction mostly because each electron is guaranteed to be both simple and the same. Cells are both complex and distinct from each other (based on both genetics and internal physiology and so-forth). So macro-configuration of cells would seem to be a more leaky abstraction.

Roger J. William's classic text Biochemical Individuality describes how much the parameters of even very basic physiological functions varies from person to person. So unlike a chip which starts with simpler building blocks and is designed to depend on discreet inputs as much as possible, the simulation of an organism may not have a better solution than a bottom up design with perhaps a variety of clever shortcuts.

kens · on July 20, 2012

I've been looking a lot at the 6502 processor simulation (http://visual6502.org) and it's interesting to consider the different levels of abstraction possible for chip simulation. The current simulator simulates abstract on/off transistors, which is sufficient for almost everything, but it doesn't exactly handle some unsupported opcodes that put conflicting signals on the bus. For that, you'd need to simulate actual voltage levels. If the transistors were very small, the voltage abstraction would break down because each electron starts to count. The simulator also ignores propagation delays. Moving up the hierarchy, people have gate-level 6502 simulations, which are tricky to implement exactly since the 6502 uses a lot of pass transistors and stored charge, rather than strictly Boolean logic. And then the typical CPU simulator runs at the register level, which is a lot simpler, but often gets the corner cases wrong (e.g. decimal arithmetic with invalid inputs).

The point is that circuits can be simulated at many different levels of detail, with low-level simulations more likely to get things exactly right, but with high-level simulations much faster, easier to write, and easier to understand.

Likewise, it will be interesting to see with cell simulations how much complexity can be abstracted away and still have a useful simulation. To get all the protein interactions right for example, you'd need to simulate individual atoms, which is insanely slow. So to simulate a cell, you're probably running at the level of protein and chemical concentrations and known interactions, which is faster but introduces error. For instance, how much do local concentrations matter? And to simulate a multicellular organism, you're probably going to make the cells fairly abstract.

Personally, I think the key area for biology is going to be dealing with cell state. Cells hold state in a lot of different ways over many time frames (eg epigenetics), and I think computer scientists have a lot to offer biologists in understanding state. Someone else mentioned the few hundred cell types in the human body, but the internal state makes a huge difference. (Not to mention distributed state, such as how the brain stores information.)

coopdog · on July 21, 2012

I wonder if it would be possible (in the not so distant future) to simulate an entire human, but 'cache' the results of parts of the simulation. Let the program decide where a certain interaction has been calculated too much and create its own level of abstraction.

Sort of like how image compression algorithms store an abstraction of the image for large, uniform areas and get more granular for complexity, but applied to simulation instead

ahalan · on July 20, 2012

Porting researchy Matlab code to optimized Fortran or C could potentially give you the same results as throwing more hardware at the problem, but cheaper and probably faster.

gwern · on July 20, 2012

But unless the optimizations change the complexity, you're still going to be very limited: the constant factors may let you go from 1 cell in Matlab to 100 cells in C on the same hardware, but that's still just 100 cells and now you're stuck.

technotony · on July 20, 2012

Human's have 23,000 genes, vs 525 in this organism. On the assumption that complexity increases with the square of the number of genes you would need 245,000 computers today to simulate one human cell. And it could be more than that because of RNA etc.

radicalcut · on July 20, 2012

Human cells may have tens of times more forms of proteins expressed from these 23,000 genes if you take into account alternative splicing of pre-mRNA and many other post-translational modifications. These are all processes which prokaryotic organisms like M. genitalium generally lack.

Even when we look aside from the level of DNA/RNA there are huge differences in morphological organisation of eukaryotic cells when compared to most prokaryotes: dynamic compartmentalisation of cytoplasm, different types of cytoskeleton, vesicle trafficking, complex signal transduction networks instead of usually simple two-component regulatory systems... So the simulation of whichever human cell type could be much more complicated than one could initially thought.

I don't want to sound too much pessimistic, as someone with background in both CS and molecular biology I'm truly excited about this, but I still had to cool myself down a little bit after reading the article. I can't wait to read the original paper.

jff · on July 20, 2012

I don't think you can just "throw a datacenter at the problem", this is why Blue Gene exists rather than just renting Google's datacenters. Modeling cells in the body would have some advantages in parallelism due to the very real locality of the problem, but with modern parallel programming techniques things do not scale that simply. A 512-node cluster would not necessarily be able to model something 4 times as complex; as the complexity increases, the computation/communication ratio will decrease, meaning your network (especially a datacenter network) will become a bottleneck.

thronemonkey · on July 20, 2012

The problem with that view is that a human cell is vastly more complex than a bacterial cell. Let's take transcription as an example. Transcription in bacterial systems is well understood at the chemical level, can be reconstituted in vitro, and simulated with a high degree of accuracy. Transcription in Eukaryotes (such as humans) is an absurdly complex process involving oodles of regulatory proteins that interact in unknown ways, chromatin remodeling and 3D interactions, and adapter factors that recruit RNA polymerase in as yet unknwon ways. The whole process is extremely poorly understood and probabally won't be reconstituted in vitro in our lifetime.

This is only one example of many cellular processes which are substantially harder to understand in eukayotes than bacteria. The lesson is that not all cells are created equal, different kinds of organisms have very different cells.

Also note that modeling interactions between cells makes things much more complicated in a very nonlinear way.

kibwen · on July 20, 2012

I wonder if an organ could be simulated taking a smaller selection of cells, determining their characteristic interactions and extrapolating from there.

Sort of like a Hashlife[1] for real life, as it were.

And let me just say that I sure hope we're not still coding in 80 years. Fun as it may be, we'd probably be better off letting machines handle that sort of thing. Get started on that AI!

[1] http://en.wikipedia.org/wiki/Hashlife

Dn_Ab · on July 20, 2012

We don't need full organism simulation for this to be useful. We just need tissue and cell level interactions that are accurate to have this be useful in drug search. As we want to look at effects on metabolism etc a coarse grain approach could provide direction.

The computational complexity aspect could be side-stepped if quantum computers were invented before 80 years. One of the few things quantum computers can do exponentially better is simulate quantum mechanics. Biology is chemistry which is "just" a bunch of quantum many body problems. With a QC we can model Protein interactions, RNA, signaling, transport etc much more efficiently than a regular computer.

pbw · on July 20, 2012

Yes I think you'd want to do a lot of extrapolation, at least for the next 80 years.

You could imagine a hierarchy of simulations which dynamically adjusted the level of detail. Do you simulate every cell division to see if cancer develops, or do you just do "perfect" division with a roll of the dice for a copy error? Depends on your needs.

You could imagine organs like your skin or liver have lots of redundant computations, where you can elide away details more easily. While say your brain probably depends on the idiosyncratic behavior of individual neurons bubbling up to a macroscopic level, requiring a more detailed simulation.

technotony · on July 20, 2012

According to wikipedia (http://bit.ly/SLQDiT) there are between 215 and 411 different cell types in the human body. I would expect each of those different cell types to behave similarly (maybe one order of magnitude difference to account for environmental effects around the specific cell). This means we should be able to do a human simulation much sooner than 80 years.

riffic · on July 20, 2012

why are you using URL shortening? http://en.wikipedia.org/wiki/List_of_distinct_cell_types_in_... is better.

joe_the_user · on July 20, 2012

It is an extraordinarily interesting question.

Having multiple sets of same cell would only speed up the simulation under some circumstances. In the most general case, it wouldn't. For example, if a cell itself is a Turing complete computer whose computations matter, then you definitely need to individually simulate each cell.

nagrom · on July 20, 2012

For a large-scale organism, how would you set the starting conditions? It seems unrealistic to not include some external influences from, e.g. air and diet. When you simulate a bigger organism, not only do you face a geometric increase in the number of cells but you also must account for externalities, I guess. I have no idea how you may do that on a mouse, let alone something as large as a human...

tocomment · on July 20, 2012

You're forgetting about interactions between the cells.

siavosh · on July 21, 2012

As an undergrad, I did some research work in a bio-computation lab trying to model the basal ganglia (a part of your brain that helps you move and learn). I left shaking my head at how much guess work, "fine-tuning", and hand waving there was in the field. The biologists didn't appreciate how crappy code can deceive and the engineers didn't appreciate the mind-boggling complexity and non-imperative world of biological systems.

I can't even imagine how much parameter tuning and hacks went into a model of such staggering complexity. Paraphrasing one of my old academic advisors, the curse of models is that you can always make them look good.

pbw · on July 20, 2012

http://www.stanford.edu/~jkarr/research.html look for "more information" links to animations, etc.

http://wholecell.stanford.edu/ contains a link to the code (written in matlab)

drcode · on July 20, 2012

I think this news is huge: I suspect they probably took a lot of shortcuts and aren't simulating everything at the level of individual atoms. However, this type of system could be refined and perhaps give a 99% accurate simulation of a cell.

Imagine if you could get the DNA from a cancer cell in a human patient, as well as the DNA of a normal cell, and then test the effect of a million different randomly-generated molecules until you find one that kills the cancer cell, but not the normal cell.

If you could scale the performance of this type of system and allow it to simulate Eukaryotic cells (much more difficult) it might let you cure most cancers!

Cieplak · on July 20, 2012

Any clue what software/technology they are using? I'm guessing Java.

tosseraccount · on July 20, 2012

It appears to be hodgepodge of technologies but mostly Matlab (.m) ...

Downloaded code zip file and unziped to to "WholeCell" directory

find WholeCell -type f | awk '{FS="."; print $NF}' | sort | uniq -c | sort -n

...

      2 java
      2 lib
      2 log
      2 mexa64
      2 mexw32
      2 pdf
      2 swf
      2 vbs
      2 xlsx
      3 exe
      3 fsa
      3 mexw64
      4 jpg
      4 svg
      5 desktop
      5 license
      5 sql
      5 TXT
      6 col
      6 sh
      6 tmpl
      6 tpl
      7 ico
      7 xml
      8 bat
      8 pl
      8 z
      9 json
     15 dot
     16 gif
     18 map
     19 css
     23 dll
     24 dat
     25 p
     52 mat
     62 txt
    114 jar
    238 png
    277 js
    427 php
    531 m
   1096 html

apl · on July 20, 2012

A lot of that stuff is incidental stuff; see .php, .html, .js, and so on. (Apparently there are some web-based tools that he mentions; not sure, though.) The bulk amounts to MATLAB plus Java libraries for remedying MATLAB's functionality gaps.

jkarr · on July 20, 2012

We used MATLAB. We organized the training data into a database (referred to as the "knowledge base") which was implemented in MYSQL with a web-based UI implemented in PHP. The various other scripts in shell and Perl were used to run the model on our Linux cluster.

kibwen · on July 20, 2012

I know exactly zilch about this field, so it very well could be something that mundane. However, for spreading calculations across a cluster of 128 computers, I sort of hope they're using something more specifically tailored for the task.

Keeping in mind that my knowledge is zilch, here are some exotic languages that appear to be designed for this sort of thing:

http://en.wikipedia.org/wiki/X10_%28programming_language%29

http://en.wikipedia.org/wiki/Chapel_%28programming_language%...

http://en.wikipedia.org/wiki/Fortress_%28programming_languag...

bugsbunnyak · on July 20, 2012

The source code is available.

https://simtk.org/project/xml/downloads.xml?group_id=714#pac...

I went to a talk by Dr. Covert a few months ago and it was fascinating, all the moreso as a chalk-talk with an animation flip-book handout.

As an aside, since the topic of CS folks helping with biology research comes up often: apparently a Google engineer went on sabbatical to the Covert lab and helped re-architect and optimize the system to the point that it was feasible for this research. There are many other projects that would benefit from this type of expertise.

apl · on July 20, 2012

Nope. Seems to be heavily MATLAB-based.

dglassan · on July 20, 2012

I'm wondering the same thing

Achshar · on July 20, 2012

Can any one explain what they mean by simulate? Like to what depth? Do they define different types of cells and their functions and then let them work? Or molecules, atoms? particles?? I strongly believe a "proper" simulation is impossible because we don't even know the bottom of the barrel, so best we can do is define a building block and how it behaves and see it grow from there. Although the coolest thing would be to define the quarks and four forces and then let it organically grow into some kind of matter over time. We can also control time in a simulation which will basically confirm our understanding if the outcome of the simulation is similar to real world.

JackC · on July 20, 2012

Just based on the Times article, I think the idea is, define a simulation of a cell and subject it to a bunch (100s) of simulated experiments, based on real experiments that have been run on real versions of the cell. If you get the same results from your simulated cell that other researchers got from watching real cells, then your simulation is potentially accurate enough to run new experiments on that will generate useful information about the actual organism. The fact that it predicts the outcome of previous experiments we were interested in suggests (hopefully) that it has sufficient resolution to predict new stuff we're interested in.

So in this case, it sounds like they're simulating the interaction of genes and molecules, since they think that's sufficient to model cell behavior (and/or it's the best we can do). But it doesn't really matter what technical level of detail they went to -- the only useful definition of a "proper" simulation is whether it behaves the same as the real thing in the context you care about. For example, this simulation would be totally insufficient if I wanted to model a hydrogen bomb -- but totally excessive if I wanted to model gravitational forces on independent objects in space. If it's good enough to tell us anything new about the actual cell, that'll be pretty cool.

jkarr · on July 20, 2012

We used a combination of representing copy numbers of molecular species and representing individual molecules of a few specific types: DNA polymerase, RNA polymerase, ribosome, FtsZ

SoftwareMaven · on July 20, 2012

I would love to see this software on Github and to watch the organism "evolve" in different forks into different virtual organisms. A very meta-biological-software thing.

nagrom · on July 20, 2012

Well, the comment above yours says that the whole thing is available: http://news.ycombinator.com/item?id=4272721 There's nothing to stop you from creating a Github project with all that information yourself.

In my experience, physics researchers are not familiar with GitHub or similar OSS hosting sites. The omission of a Github project is likely through ignorance or apathy, rather than disapproval. I'd be happy (delighted!) for you to take my own physics simulations and play with them, for example.

noduerme · on July 20, 2012

I'm wondering what language this code was written in and if there couldn't be some significant speed gains by threading it differently. Is this modeled down to the molecular level, or is it a simulation of higher processes with known results?

For one thing, if the protein interactions are abstracted and what's chewing things up is the modeling of them, then it would make more sense to use a lot of powerful GPUs; then you could spread the calculations out over a bunch of Xboxes.

Secondly, despite the huge amount of data generated, you should be able to monte-carlo outputs for inputs on a thing like this. In essence, use it to generate a rainbow table of states, inputs and outputs so you can then run a multi-cell organism much more quickly. Assuming every cell has the same DNA.

mtinkerhess · on July 21, 2012

In designing their model, the scientists chose an approach called object-oriented programming, which parallels the design of modern software systems. Software designers organize their programs in modules, which communicate with one another by passing data and instructions back and forth.

Similarly, the simulated bacterium is a series of modules that mimic the various functions of the cell.

Pretty cool that a relatively mass-market outlet like the Times thought it was worth mentioning OOP. Even as a software developer, it's a stretch for me to visualize what it means to simulate an organism; what a difficult job for the Times to distill it down into a couple of paragraphs for a lay audience.

paulsutter · on July 20, 2012

“The model presented by the authors ... should be commended for its audacity alone"

Audacity, yes we do need more of that. pg says that tenacity if the single most important trait of an entrepreneur. Maybe audacity is a multiplier that brings us the big innovations.

dllthomas · on July 20, 2012

audacity + tenacity => high variance

lectrick · on July 20, 2012

It looks like they have modeled the subprocesses of the cell and run a simulation of an assembly of those, instead of actually running (for example) a physics simulation of the entire organism at the atomic level. FYI

reitzensteinm · on July 20, 2012

There are 7 * 10^9 carbon atoms in E Coli [1], so we're probably a decade or two away from being able to do any kind of simulation at the atomic level.

Though it's probably getting close to the point where getting the right algorithms is the bottleneck; if you could somehow bring the techniques and optimizations we'll inevitably learn over the next 100 years back to today, a half decent simulation would probably already be possible on a cluster of commodity hardware.

Unless the techniques revolve around a new method of computation, of course - maybe memristor logic helps with that kind of thing.

[1] http://bionumbers.hms.harvard.edu/bionumber.aspx?s=y&id=...

bugsbunnyak · on July 20, 2012

Atomic-level would be overkill for a lot of cellular processes. It's certainly crucial for protein interactions (for example), but affinity/dissociation constants are well-characterized so modelling individual binding site interaction isn't going to make a difference at the (relatively) macro level.

SilasX · on July 20, 2012

It says it was a bacterium. So in that case, single cell = organism.

dailo10 · on July 20, 2012

While reading this, the Matrix crossed my mind:

What if we're just living in a giant simulation? What level of computing power would it take to run the planet Earth?

usefulcat · on July 20, 2012

“The major modeling insight we had a few years ago was to break up the functionality of the cell into subgroups which we could model individually, each with its own mathematics, and then to integrate these sub-models together into a whole,” Dr. Covert said. “It turned out to be a very exciting idea.”

I gotta say, if I was going to be working on that code, that statement would make me rather uncomfortable.

technotony · on July 20, 2012

This is a great example of a field which will be massively impacted by Moore's law. In 10 years time you will have the equivalent processing power in just one computer, meaning hobbiests are going to be able to run these kinds of experiments. Then things are going to get really exciting!!

sgoranson · on July 20, 2012

So if we ran this simulation for 4 billion years would it pass the Turing test?

praptak · on July 20, 2012

There is a patent on that: http://www.google.com/patents/US5621671 "Digital simulation of organismal growth." Draw your own conclusions.

nosse · on July 21, 2012

It's only for Drosophilia embryo. https://en.wikipedia.org/wiki/Drosophilia

Owned by U.S. Navy, issued 1987 so it's free now.

chatmasta · on July 20, 2012

Who else thought this said orgasm?

daveman · on July 20, 2012

Women have been able to simulate those for a while now.

splat · on July 20, 2012

Well, they did study the bacterium M. genitalium.

ebtalley · on July 20, 2012

ditto, I got half way through the article expecting discussions about neurons.

rglovejoy · on July 21, 2012

I read it as 'organization'.

dedward · on July 22, 2012

Anyone else recall a Greg Egan story along these lines? can't recall the name.

RobertKohr · on July 23, 2012

Yep, Permutation City - the human brain was scanned at super high resolution (was able to snapshot the current state of all cells). Then it was able to be run in a simulated world with simulated sensory inputs. And of course multiple copies could be run.

I am really just scratching the surface of what was covered, but it was a great book and I would recommend it for anyone interested in simulated life.

martindale · on July 21, 2012

Does it emulate mitosis? Can we artificially emulate mutations?

ef4 · on July 20, 2012

Basic biology research is really exploding. Very exciting stuff.

tpowell · on July 20, 2012

We need Digg back for comments on stories like this...