The substance of the article is quite interesting, but the headline and premise--that it was a computer program and not humans who found the result--is ridiculous. The computer program did not collate and index the raw data and notes. The computer program did not choose the relevant inputs from the sum of all knowledge. And most importantly, the computer program did not write itself.
Software is a tool that humans create and use, not an entity in itself. Even if you think true AI is near at hand, this article describes nothing of the sort.
Houses are far easier to build with saws, hammers, and nails than by manipulating wood, earth, and metal with our bare hands, but that does mean the tools built the house.
I agree that this wasn't done by the computer (did computers uncover the Higgs Boson?) but I also do not believe humans can take most of the credit: this was the result of a Man Machine System team up—trying to disentangle credit assignment is not a worthwhile activity. Roughly and from a quick reading of a paper thickly frosted with jargon I am unfamiliar with, the method works by creating networks—which highlight key relationships—for visualization by searching for stable clusters in a reduced dimensionality space of the variables.
Humans are there to explore the visualizations, interpret the network structures and understand the clusters and variables. The machines are intelligent too; they do the heavy work of comparing large numbers of points in a high dimensional space, factorization and searching for a way to express the data in a manner that makes it easier to uncover promising research directions and hypotheses.
Scanning this, it seems the most valuable contribution are their network visualization and exploratory tools. I think they should be proud of those and see no need to stretch so mightily to connect this to Stronger AI. As Vinge notes, "I am suggesting that we recognize that in network and interface research there is something as profound (and potential wild) as Artificial Intelligence."
>I agree that this wasn't done by the computer (did computers uncover the Higgs Boson?) but I also do not believe humans can take most of the credit: this was the result of a Man Machine System team up
You realize that they're using software made by a team of mathematicians and software developers, right? If you want to give credit to the software, give credit to the people who wrote the code and discovered the mathematics. This isn't any different than how physicists would use Mathematica.
I think the second big take away is that this was only made possible because the scientists willingly shared their "dark data"-- data and lab notes from failed experiments. I wonder how much data is hoarded privately and never opened up and analyzed like this.
Frequently a lot of this hoarded data is flawed or defective due to improper setup or execution of the experiment. That isn't to say the information in this "dark data" is useless, but it needs to be taken in context. The cleanest data with the best results are put forward into a paper; the chaff is not.
It is also possible that the data is not understood and therefor thought to be flawed. The article talks about how the black box approach removes human bias from the initial findings.
Not if your dark data is "I forgot to autoclave an instrument and contaminated my samples." In that case without an unexpected positive result it's just error and not worth reporting.
That's simply not true. One of the most, if not the most, common ways to fail an experiment is through contamination and there are at least a dozen different types of bacteria in the average lab that are brutally efficient at outcompeting whatever is in your sample and probably thousands more that are problematic at best. Once your sample is contaminated it is useless because the number of variables out of your control grow several orders of magnitude in an already poorly bounded experiment.
Even if you have the best biosafety hood with proper airflow to pull things away from your samples, even the simple mistake of taking off your gloves in the hood or wafting your hands over petri dishes is enough for some skin cells carrying bacteria to wipe out an entire experiment.
They are, in millions of lab notebooks around the world that will never see the light of day, and for good reason. There are so many more experiments that end with the unqualified final note "samples contaminated" than successful ones that if biologists spent time tracking down the source or even the type of the contamination, we probably still wouldn't have modern medicine.
Is there any kind of survey of all failed experiments and the causes? What are the numbers? What percentage of experiments fails? How can we be sure the successful trials weren't random if the failures aren't reported in any way?
You seem to be conflating all possible modes of failure under the simplistic designation "failed experiment" and setting up impossible standards for record keeping. The clinical trial equivalent of sample contamination is a patient getting hit by a bus minutes after they receive the first treatment of a trial. Sure if its a psychoactive drug you would investigate if it contributed and you can include this tiny little blip of data in the thousands of other pages you give to the FDA but what's the point? The trial was ruined by an unpredictable act of nature and your resources can be much better spent focusing on the other patients than investigating if the driver was intoxicated or if the hospital needs more stop signs, which are entirely irrelevant to whether or not your drug works.
I am by no means advocating that well thought out and executed experiments that fail to provide evidence for the experimenter's hypothesis should be locked in a dusty file cabinet forever closed to study, but those are few compared to the total number of experiments that ended due to clumsiness, sleep deprivation, or too many undergrads in the lab. Science is all just human error, through and through.
Except in this case your build system is configured in such a way that a failed build triggers bundling your /usr/bin as a release. The result is mostly the same everywhere, with slight differences per programmer, and is utterly worthless for anyone.
Depends on whether the lessons have already been learned or not.
For stuff like "we fucked up our culture, therefore our cells couldn't do whatever we wanted" is well understood.
In clinical trials where some patients may respond to a treatment and others not, there's definitely a lot more to learn there, if you have a large enough data set and a plurality of controls.
Wrong data is much much worse that no data, since it may lead you down the wrong path. Think what the false news of a Russian nuclear strike in US soil would have done during the Cold War.
A recent interesting commentary in nature suggests researchers should "blind" themselves to their data, and instead analyse a similar but altered data set. When they are happy with their anaysis, the steps are then applied to the real data. The aim is to prevent confirmation bias.
I'm, unfortunately, very familiar with the LPU (Least Publishable Unit) after having been to grad school. It's actually a thing, especially among pre-tenure people.
Example from the programming world: github profiles are used as a hiring tool, programmers start dumping thousands of undocumented useless "projects" into github.
occasionally, I wonder how much more advanced our world would be if scientific and government data were simply available as a default choice. then I smile at my own optimistic viewpoint and come back to reality.
Coming from medicine I think the title of "medical breakthrough" is too generous. It's a great proof of concept but all this says is that in rats, high Bp in thoracic spinal cord injuries was associated with worse outcomes. Id like to see a follow up on human data from perioperative Bp recordings next. If it still holds true, then you can research whether an intervention in Bp control makes a difference. I'm not a neurosurgeon but I'm sure the correlation btw Bp and sci outcomes has been looked at before
"The process was outlined in a paper published today in Nature, and hints at the possibility of medical breakthroughs lurking in the data of failed experiments."
If there was some way to make sense of data from negative-results experiments reliably, it would be absolutely revolutionary and certainly turn our ideas about what constitutes a successful experiment onto its head. I am very hopeful for fruitful results from the methods outlined in this article.
I worry about the usage of "failed" experimental data here, though. I've "failed" a lot of experiments for reasons other than not finding the effect I was looking for in my data. Any exploration of data from negative-results experiments needs to be taken very narrowly, with a deep understanding of exactly what effect is being examined.
Experiments are frequently designed incorrectly for studying the effect they want, and are almost always not suitably controlled for examining non-primary effects. Try to find a trend throughout non-primary effects over a large swath of experiments, and I'm sure you will-- but it may be noise.
> Any exploration of data from negative-results experiments needs to be taken very narrowly, with a deep understanding of exactly what effect is being examined.
I disagree. There's value in data mining previous experiments, just not conclusive value. As long as the results of such data mining are limited to generating new hypotheses (which are then tested by experiments explicitly designed to do so), I think this methodology can have great value. In this particular case, I don't think it's a surprise that perioperative hypertension is associated with worse outcomes, but the hypothesis that controlling BP with medication before surgery and on through recovery might produce better outcomes is worth investigating.
What kinds of infrastructure/tech do you think will have the most utility for topological data analysis in the near future? E.g., GPUs, Apache Spark, FPGAs, etc.
Any thoughts on an Ayasdi public offering? I'd like to consider investing but I don't have millions of dollars (yet) :) .
A slightly more in-depth blog : https://shapeofdata.wordpress.com/2013/08/27/mapper-and-the-choice-of-scale/
A very accessible book about topology (especially from an algorithms perspective) : http://www.amazon.com/Computing-Cambridge-Monographs-Computational-Mathematics/dp/0521136091/ref=sr_1_1?ie=UTF8&qid=1444971634&sr=8-1&keywords=topology+for+computing
Blog exposing persistent homology : https://normaldeviate.wordpress.com/2012/07/01/topological-data-analysis/
Videos exposing persistent homology :
https://www.youtube.com/watch?v=CKfUzmznd9g
https://www.youtube.com/watch?v=CKfUzmznd9g
Some free software:
Python Mapper by Daniel Müllner : http://danifold.net/mapper/index.html
JPlex library by Harlan Sexton : http://www.math.colostate.edu/~adams/jplex/index.html
Dionysus by Dimitriy Morozov : http://www.mrzv.org/software/dionysus/
Topological Data Analysis in R : https://cran.r-project.org/web/packages/TDA/vignettes/article.pdf
Infrastructure
Our tech stack is:
Backend
HDFS for storage
Our ML and Math code is hand-rolled C++ and Assembly(7% LOC)
All coordination/distributed systems code is in Java
ZMQ for communication
Protocol Buffers for protocol
Frontend
D3
Backbone
Hand-rolled webGL graph visualization (we open sourced it at https://github.com/ayasdi/grapher)
We currently don't use GPUs or any other fancy hardware primarily because today, our customers use commodity hardware and getting F1000 companies to buy cutting-edge hardware is just plain horrible.
We have an awesome GPU rig at our offices that we test algorithms on and it can really make our algorithms scream, but again, none of our customers have/are willing to invest in GPUs.
Apache Spark - it is interesting that in our experience, making it work for ML algorithms is really too much work unless you invest the time to understand the framework and its fundamentals. It performs very well for ETL type tasks, which is what we use it for.
On a public offering: no comment :)
If you have more questions - I am easy to find :)
I'd love to read a couple of journal articles that you recommend to learn about TDA. I do large scale data analysis on health care data at my university and am always on the look-out for interesting techniques.
Hidden relationship mining has taken a few different paths from TDA to LDA graphical modelling (Michael Jordan, David Blei) to Vector Space driven (Berkeley Lab). Extracting hidden relationships in datasets and using these to form new hypothesis and enable or even make new discoveries is certainly the future...
Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span - Blei DM1, Franks K, Jordan MI, Mian IS. - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1533868
One immediate question is whether this might be a result of overfitting -- when enough hypotheses are tested, some will surely be confirmed at any given significance level. Still quite interesting of course; now what is needed is a follow-up study on other datasets (preferrably human), or an experiment to confirm.
A better (but less exciting) title would be "A computer program suggests a promising avenue of research".
Things like this why I feel something like google deepmind can be a game changer for the sciences if the research data of all human research data was available to them.
They might never reach the point of true AI but they would still beat all humans and finding relationships between data humans cant even remember.
This is the breakthrough:
In the case of the spinal cord injury data, Ayasdi’s TDA-driven approach mostly confirmed what researchers already knew: The drugs didn’t work. But the discovery of high blood pressure’s detrimental effects on long-term recovery has immediate implications for human patients, namely whether the use of hypertension drugs immediately after their injuries and before surgery could improve outcomes,
From the article: "In the case of the spinal cord injury data, Ayasdi’s TDA-driven approach mostly confirmed what researchers already knew: The drugs didn’t work."
How this "Ayasdi" company's analysis probably works (based on "Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival" and the original "Mapper" paper "Topological Methods for the Analysis of High Dimensional
Data Sets and 3D Object Recognition"): They take point cloud data and connect each point with its neighbors (the distance metric that is used is probably domain-specific) to build a proximity graph that approximates a simplicial complex. As input to their algorithm, they also have one or more scalar functions defined on the point cloud data that contain information which is relative to the problem at hand. For example, each point could be a gene, and maybe the scalar function value at that gene could be probability of association with some disease, and the distance between two genes might be the Levenshtein distance between their genetic codes.
With data in this form, they approximate the Reeb graph of one of the scalar functions, which is a sort of "data skeleton." They can do potentially interesting/useful things with it.
The approximation of the Reeb graph reveals zero-cycles (connected components of the simplicial complex) and some one-cycles (handles/tunnels in the graph, sort of like holes in a donut). This "skeleton" of the data allows them to do a variety of things, such as segment the data into components that are (approximately) topologically "simple" (they do not contain any 1-cycles), identify local maxima/minima, find saddle points where forks in the data merge together, and locate "essential saddles" which constitute the high points and low points of handles/tunnels. They can also remove "topological noise", which helps them to separate spurious topological thingies from features that might be important.
Their technique doesn't necessarily recover "true" topological information since a lot of what they do is approximate. There are actually more accurate techniques (e.g., simplicial homology, or fast Reeb graph algorithms) for getting an exact answer, albeit with potentially higher computational cost.
Topological data analysis is a big field, and this Ayasdi company appears to mainly use this one approach (but I could be wrong). I think they are trying to lay claim to the term "topological data analysis" and get people with money excited about it.
One quick edit to this description : We (Ayasdi) have generalized the notion of Reeb Graph's - such that it is no longer limited to single scalar functions. While in the single scalar function the mapper algorithm is an (extremely efficient) approximation to the Reeb Graph, in the multiple scalar function case, it has no direct theoretical analogue (although the notion of Reeb Spaces is similar).
We are generally not trying to lay claim to the phrase "Topological Data Analysis" and not going around suing people for using it. In fact we still support research in academia and actively publish in the field. TDA is the basis of what we do so it is the most efficient way of describing it.
Before Ferguson had thought to use it for probing spinal cord injuries, Carlsson and others researchers had successfully employed TDA to find a unique mutation in breast cancers hiding in data sets that had been publicly available for more than a decade...\" In the case of the spinal cord injury data, Ayasdi 's TDA-driven approach mostly confirmed what researchers already knew: The drugs didn 't work...
Auto Extracted Ranked Tags
(Algorithm: Tuatara GS1)
Software is a tool that humans create and use, not an entity in itself. Even if you think true AI is near at hand, this article describes nothing of the sort.
Houses are far easier to build with saws, hammers, and nails than by manipulating wood, earth, and metal with our bare hands, but that does mean the tools built the house.