Google Open-Sourcing TensorFlow Shows AI's Future Is Data, Not Code

chimtim · on Nov 17, 2015

IMHO, Google released Tensorflow because AI is currently being driven by research, and researchers were mostly writing code for Torch that is used at Facebook. So FB folks were enjoying lots of new algorithms, benchmarked against their systems. Google open-sourcing TF shows that benefits of open-source outweigh the disadvantages of being closed source. Even if the future is data and not code, companies will not open-source their code if they have nothing to benefit.

datashovel · on Nov 17, 2015

Yes, I think at Google's scale one of the biggest pain points is almost certainly integrating new employees to do things "the Google way". Instead of picking up qualified people who may or may not buy in to what you're doing internally, why not pick from a group of candidate employees who have already bought in and have self-taught, at zero cost, how to do things "the Google way".

Also, as far as I understand, TensorFlow is not technically a set of proprietary algorithms. It's basically a framework for ML.

munificent · on Nov 17, 2015

> I think at Google's scale one of the biggest pain points is almost certainly integrating new employees to do things "the Google way".

Ramping up at Google is definitely a stressful process, but I don't think open sourcing technology puts much of a dent in that. Even if you show up your first day at work knowing every tool Google uses, on your second day there will be some new tool and something else will be deprecated. By a couple of years, damn near every piece of software you were familiar with will have been replaced by something different. You are in a constant state of learning.

This is, I think, one of the reasons Google places such a premium on algorithms and data structures in hiring. They are some of the few things that don't change often and being familiar with fundamental concepts makes it much easier to quickly pick up a new tool that uses them.

datashovel · on Nov 17, 2015

Thanks for the insights. I can definitely see what you're saying from the perspective of a software engineer.

I wonder if those same concepts translate over to onboarding people who might have a weaker programming background, such as mathematicians / theorists who may have a more difficult time making the switch. For example if they've been using the same toolset their entire careers.

This last point isn't in response to your comment, but more a response to the ideas presented in the article. I don't necessarily buy the idea that the data is incentive enough to researchers at the top of their fields to leave what they're doing to go work at Google. Surely they have access to plenty of public datasets large enough to accomplish what they want to accomplish. So the requirement for switching tools may be a much bigger hurdle when trying to recruit for ML. Maybe I'm wrong in assuming they are targets for employment at Google.

gradys · on Nov 17, 2015

Your understanding is correct. TensorFlow is a library for defining and executing mathematical operations on multidimensional arrays. It just so happens that much of machine learning, and especially deep learning, can be described as mathematical operations on multidimensional arrays.

mark_l_watson · on Nov 17, 2015

Google has a very good onboarding process (I worked there as a contractor in 2013). They have classes where instructors teach you how to use the infrastucture and codelabs that you can do at home or at work to learn more specialized things.

So, I think that they open sourced TensorFlow more as an advertisement/inducement to hire more ML people.

eriksencosta · on Nov 17, 2015

You have a good point. By open sourcing TensorFlow, Google will benefit from innovations contributed by the community at large.

Machine learning frameworks are now commodity software. I give my point about this in a blog post titled "TensorFlow is commodity software" http://blog.eriksen.com.br/en/tensorflow-commodity-software (HN link: https://news.ycombinator.com/item?id=10574444)

You last phrase is right too: Google built a massive infrastructure because it had Google File System, BigTable and MapReduce. Hadoop, inspired by the published papers, turned into a more sophisticated software (an ex-Google engineer said that, but other Googlers denying his claims) and grown into a big ecosystem of products and services. But while Hadoop was matching the feature set of Google's proprietary implementations, Google was getting even bigger.

ThomPete · on Nov 17, 2015

I think this is spot on.

And to your second point http://www.bloomberg.com/news/articles/2015-10-29/apple-s-se...

zhanwei · on Nov 17, 2015

Yes, data is important to AI but open-sourcing TensorFlow doesn't mean that code is not. Rather, it means that data and code have different strategic value. Data is their secret sauce and code is their network. The more people use and contribute to the library, the better the code gets.

Also important to note TensorFlow is probably not the complete package of what they use at Google.

eva1984 · on Nov 17, 2015

Second the last point. Without the Google's infrastructure, you can effectively leverage the effectiveness of more data.

raverbashing · on Nov 17, 2015

Actually the Tensor Flow benchmarks so far have not been exiting

wslh · on Nov 17, 2015

This is old news: Peter Norvig, The Unreasonable Effectiveness of Data (2011): https://www.youtube.com/watch?v=yvDCzhbjYWs and a previous article from 2009: http://static.googleusercontent.com/media/research.google.co...

abhshkdz · on Nov 17, 2015

Aren't we getting too ahead of ourselves here? It's just been released, we don't even have favorable benchmarks yet. Sure, it's by Google and lots of big names are associated with its development, but it still has to be adopted by the community and proved to be faster/easier/more flexible than Torch/Theano.

cwyers · on Nov 17, 2015

This is like saying that the future of houses is wood, not hammers and saws. The future of houses is in HOUSES.

kybernetyk · on Nov 17, 2015

But code is data :)

jlas · on Nov 17, 2015

Found the Lisp programmer!

cgio · on Nov 17, 2015

Why is every other article playing the anti-privacy game these days? If anything, the discussion in the article should read as pro-privacy. If there is so much value in private information, then further to ethical, there might also be financial motivation to safeguard our data. Even more so in enterprise environments and as the boundaries between personal and business devices are getting blurrier. This is also why I think enterprise will move on to MYOD (make your own device) from BYOD.

nindalf · on Nov 17, 2015

I think you wanted to comment on the NYT article on encryption [1]. This is a link to a Google AI library.

[1] - https://news.ycombinator.com/item?id=10580412

danlindley · on Nov 17, 2015

The article refers to Apple's taking a more "extreme" stance on privacy and it being at a disadvantage for doing so. While I appreciate the article is about data in AI, they neglect to even entertain the idea that privacy may be more important in the long term than obtaining some added benefit by using the personal information entrusted to them.

walterbell · on Nov 17, 2015

Could you expand on MYOD?

cgio · on Nov 17, 2015

It is just a reference to a return of the systems perimeter within the control of the enterprise. As monitoring and data collection functionality becomes more prevalent, embedded and hard to identify even on traditional platforms (e.g. Windows), I think it makes sense for enterprises to move to platforms under their control (e.g. custom Android, linux, even custom hardware etc.) Most probably that would be through a market that builds most of that for them, with sufficient guarantees/standards.

walterbell · on Nov 17, 2015

That would be nice, but is there a hardware supply chain that can support enterprises? Dell & HP seem to have given up on making money from endpoints. OEMs like Quanta and Foxconn would need middlemen to support customers. Secure OS software like Qubes struggles to find OEMs who care about integration.

vid · on Nov 18, 2015

Not a word in the article about Wikipedia, which is the source of so much learning material by all players. Not to mention WikiData, which Google supports. It's really a shame how concrete signs of a thriving commons, a really exciting thing in the world and as important as the internet, are ignored by surface level press.

masonhipp · on Nov 17, 2015

It's been pretty clear for a while now that data has enormous financial and strategic value.. But that doesn't mean code _isn't_ the future. You really don't have much value out of one without the other, at least in the age where most of our data is not easily understood.

The biggest issue with our learning algorithms is that they are incredibly complicated and require high levels of mathematical understanding. The number of people driving forward machine-learning is small simply because it is such a difficult subject. There are many more people aggregating large and interesting collections of data. I think by releasing TensorFlow Google is encouraging data-collection built around their software; making it easier for a majority of people to benefit from machine learning while ensuring the continuation of their own product, code, data-collection, and ecosystem.

0xdeadbeefbabe · on Nov 17, 2015

This isn't news to wired is it? I thought wired knew what Andrew Ng said, "I think AI is akin to building a rocket ship. You need a huge engine and a lot of fuel. If you have a large engine and a tiny amount of fuel, you won’t make it to orbit. If you have a tiny engine and a ton of fuel, you can’t even lift off. To build a rocket you need a huge engine and a lot of fuel.

The analogy to deep learning [one of the key processes in creating artificial intelligence] is that the rocket engine is the deep learning models and the fuel is the huge amounts of data we can feed to these algorithms." http://www.wired.com/brandlab/2015/05/andrew-ng-deep-learnin...

tacos · on Nov 17, 2015

The "future" of AI is systems, not code or data individually. It's also the past and present. Funny, that.

asgard1024 · on Nov 17, 2015

I think the data can be, and should be, crowdsourced, like Wikipedia or OpenStreetMaps are. However, I have no idea how to do that. How to take two learned neural networks (or whatever) for a specific application (like image recognition) and merge them efficiently? I think it needs to be figured out first.

joshmarlow · on Nov 17, 2015

I don't know much about actually merging networks. I knew a guy who trained multiple networks in parallel for his M.S. and then combined them by just averaging the corresponding weights. It seemed to work well. For this to work, the networks needed to have the same architecture (same number of layers, nodes in each layer and same connectivity between layers).

Now if you just want to train multiple neural networks (or other classifiers) on different datasets (to have different strengths) then you can keep them separate and build a composite system that lets each network "vote" on an answer to a given problem; the decision of the overall system is a weighted some of the components. See [0].

[0] - https://en.wikipedia.org/wiki/Ensemble_learning

graycat · on Nov 17, 2015

Data? Code? I vote for engineering and, there, sometimes applied math.

E.g., once some colleagues and I gave a paper at an AAAI IAAI conference at Stanford. All the good work was just engineering. For our work, basically just some code, later I found and published some applied math that did much better.

data_spy · on Nov 17, 2015

This article is spot on. 'In the Plex' constantly referred to Google's search being strong because it had such rich search data and not because of predictive algorithms. They used the experience of what other people searched to determine what was best to show

1024core · on Nov 17, 2015

There's a reason the field is called "data mining", and not "code mining" :D

p1esk · on Nov 17, 2015

Data mining is done with code.

fauigerzigerk · on Nov 17, 2015

Here's a provocative idea: Maybe it's because deep learning and all the other popular AI algorithms are complete and utter rubbish.

Maybe using them has nothing to do with standing on the shoulders of giants but much more with standing on the shoulders of the local maximum that is achievable by throwing insane amounts of data at dumb algorithms.

If you could choose between access to algorithms and data structures that exactly mimic the human brain and a data set that contains everything all humans taken together know, what would you choose?

return0 · on Nov 17, 2015

Trained networks of artificial neurons are function approximators, so they are basically algorithms , we just don't care enough for their analytical expressions. The analytical expression is appealing but some problems may prove irreducible or very-little-reducible. I don't see a problem with either approach since they both achieve the same effect.

bduerst · on Nov 17, 2015

They're not rubbish, they have their uses. You kind of outlined where deep learning tends to shine - when you have large amounts of data and massive CPU infrastructure. Pattern recognition and machine learning work, but typically require more manual overhead than throwing tons of data at "dumb" algorithms to achieve the same output.

In end, you need both the algorithms and the data to do the work, and choosing between the two leaves you still wanting the other.

fauigerzigerk · on Nov 17, 2015

Of course, and I don't seriously claim that they are rubbish. They are useful and I admire some of the people who have developed them.

But I think we need to question why data seems to have this outsized value compared to algorithms. I don't think it is some sort of information theoretical invariant. It's a relationship between the specific algorithms and the specific sort of data we have.

enlightenedfool · on Nov 17, 2015

I beg to differ. Assuming we want to make machines intelligent by mimicking humans, more focus would go into modeling brain which means algorithms. A human (perhaps even a baby) could see one image (and its context?) and identify anything similar. It's not processing terabytes of similar images to "learn" about the object.

RobertoG · on Nov 17, 2015

"It's not processing terabytes of similar images to "learn" about the object."

Evolution did the processing and saved the results as the architecture of the visual cortex. I suppose that, in a way, it's a work in progress.

return0 · on Nov 17, 2015

although some very basic abilities are innate in the brain, it's through processing of zillion-bytes of data throughout development that humans acquire most of the skills. plasticity basically boils down to network training, although we still know far less about the former than the latter.

It's quite likely we ll have intelligent agents before we can model the brain (even better, we will let them do that for us)

yeukhon · on Nov 17, 2015

I thought TensorFlow is designed to run mostly on Google's infrastructure, although I have seen a post about someone trying to get the CUDA code working on Amazon GPU instance. Am I mistaken?

steamer25 · on Nov 17, 2015

I haven't tried it out yet but one of their 'selling' points is portability. From the current front page of tensorflow.org:

"TensorFlow runs on CPUs or GPUs, and on desktop, server, or mobile computing platforms. Want to play around with a machine learning idea on your laptop without need of any special hardware? TensorFlow has you covered. Ready to scale-up and train that model faster on GPUs with no code changes? TensorFlow has you covered. Want to deploy that trained model on mobile as part of your product? TensorFlow has you covered. Changed your mind and want to run the model as a service in the cloud? Containerize with Docker and TensorFlow just works."

siscia · on Nov 17, 2015

Machines by themselves don't and will never understand data, what we feed into the machine must be carefully clean, it won't work at the first run and it is necessary to have at least an idea of the why it isn't working...

Even if the software is trivial, and is not, is still necessary a lot of specialized, high skilled, work to make the whole AI deal work...

Not to mention that to collect data you need well crafted software...

ThomPete · on Nov 17, 2015

We don't "understand" data.

We interpret it into meaning something without even being aware of all the data that we recieve.

Humans are pattern recognizing feedback loops who simulate a reality. There is nothing what so ever that indicates AI wont be able to do that.

sergiosgc · on Nov 17, 2015

>Machines by themselves don't and will never understand data.

The apocryphal "nobody will ever use more than 640kb" should have taught us to never say never. Of course machines will understand data. Dirty data, full of errors, unfiltered and non curated. Just like we do.

pavlov · on Nov 17, 2015

The understanding of data is data in itself. Turtles, turtles all the way...

pjmlp · on Nov 17, 2015

Until they learn to think by themselves and release Skynet upon us....

Jokes aside, it if perfectly natural that if we ever manage to understand how biological computers work, we might be able to make them think just like us.

It doesn't need to be now, it can take a few hundred years more, assuming we don't destroy ourselves until then.

yeukhon · on Nov 17, 2015

But you need guidance throughout your life to understand things and build knowledge. You can have a computer as sophisticated as a human brain. I am sure in the history of evolution our human ancestors did not told by a deer how to start a fire, or make clothes. But I think they slowly build up the knowledge, and pass to the next generation. You are right that there is brain in DNA.