Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Machine Learning for Hackers (oreilly.com)
182 points by ahalan on Feb 8, 2012 | hide | past | favorite | 43 comments


I hate to be high brow here, but I'm just waiting for O'Reilly to release Brain Surgery for Hackers. Some things are just better learned the hard way, sitting down and getting a more thorough introduction.


Maybe you're right, but I think there's something to be said for teaching applied ML separate from the theory. I took a very in-depth theory-focused ML class at Caltech, in which lectures made up a thoroughly rigorous mathematical introduction, and the problem sets were only maybe 10% applications. I and my classmates came out of the class feeling super comfortable with the theory behind learning in general and how it applied to many of the standard ML algorithms, but without any experience actually working with very large data sets and building ML algorithm implementations that scale. That's what I'm hoping this book will help with, and I think it is appropriate to separate that kind of material from the theory.


Good point. Does one really need to know how to code up a neural network in order to just use it? This has been my biggest frustration with some teachers of Machine Learning. They need to abstract away the innards and provide a usable high-level interface. I know this is hard to do. ML isn't magic ... you have to know what you are doing. But the same can be said for the automobile when it was first invented. Today, to effectively use a car, you don't need to know anything about engines.


I think these books are not really meant to be a PhD education but more of a tutorial introduction to the field. What O'Reilly is doing right now is really important. As Steve Yegge said, they are "trying to provoke a culture change."

Would-be brogrammers will find these books, read them, and graduate from cat-picture projects into more sophisticated applications that can solve real computer science problems. It matters less whether the readers of these books actually get a comprehensive understanding of the theoretical knowledge behind them, then it does that they enrich readers and spread the desire to learn and attain true understanding of the field.

Steve Yegge reference: http://www.youtube.com/watch?v=vKmQW_Nkfk8


You underestimate the value of ML for monetizing cat pictures.


Simple machine learning isn't particularly magical and many of the simpler techniques are pretty trivial and can give perfectly usable results without needing much of a thorough theoretical background. Sure you can get better results faster if you dig into the details and really learn and understand what you are doing, but many real world problems don't really require that. No need to wrap up simple usable algorithms and shrouds of high-brow mysticism.


I agree you often don't need to know some of the guts of the algorithms, but what they're doing and how to interpret their results is important imo. There is a lot of misuse of statistics, or at least sloppy use of statistics, in the "applied data mining" field. In particular, when interpreting the results you usually need to be careful with paying attention to what assumptions went into generating them, or else it's easy to make stronger claims than the data really warrants.


And here I totally agree. I guess the message that needs to be made clear is that doing machine learning is really quite easy and really nothing special, doing something meaningful with the numbers that come out from your program is the hard part that should be focused on.


This Machine Learning in Action from Manning uses python instead of R: http://www.manning.com/pharrington/

PS. Attended Machine Learning class in hacker dojo with the author, he's a bright guy. Hopefully the book will be as good.


I would love to know more about this book but it seems to not be released yet and there are no reviews. Not sure what the point of posting it is or why it is up voted. More details would be great.


[deleted]


It's a different ebook.


Have interacted with both authors, however briefly. These are both smart guys operating outside of the typical CS fields who have figured out how to apply cutting edge computational techniques to their specialties and deliver meaningful insight. Really looking forward to reading the book.


In corollary to this post, does anyone know what's up with the Stanford Machine Learning course? Haven't heard anything since the delay


It should be going live soon. I got an email about the Model Thinking class which launched a preview at http://www.coursera.org/modelthinking/lecture/preview

Hopefully that means they're getting ready for the other classes to go live as well.


A followup to the AI class starts soon (I remember Feb 20 but can't find confirmation):

"Due to popular demand, we are teaching a follow-up class: AI for Robotics at www.udacity.com . Also due to popular demand, we now have a programming environment, so you can develop and test software. Our goal is to teach you to program a self-driving car in 7 weeks. This is a topic very close to my heart, and I am eager to share it with you. (This class builds on the concepts in ai-class, but ai-class is NOT required)."


Promising title, doesnt appear to have a table of contents currently.


From Safari:

		Preface
		Machine Learning for Hackers: Email
		How This Book is Organized
		Conventions Used in This Book
		Using Code Examples
		How to Contact Us
		Using R
		R for Machine Learning
		Further Reading on R
		Data Exploration
		Exploration vs. Confirmation
		What is Data?
		Inferring the Types of Columns in Your Data
		Inferring Meaning
		Numeric Summaries
		Means, Medians, and Modes
		Quantiles
		Standard Deviations and Variances
		Exploratory Data Visualization
		Visualizing the Relationships between Columns
		Classification: Spam Filtering
		This or That: Binary Classification
		Moving Gently into Conditional Probability
		Writing Our First Bayesian Spam Classifier
		Ranking: Priority Inbox
		How Do You Sort Something When You Don’t Know the Order?
		Ordering Email Messages by Priority
		Writing a Priority Inbox
		Works Cited
		Books
		Articles
		About the Authors


That looks like TOC for "Machine Learning for Email", which is just a portion of the entire "Machine Learning for Hackers" book. The full "Machine Learning for Hackers" doesn't appear to be on Safari yet.


It doesn't look as advanced as Collective Intelligence (the Python-based book which covers most of Ng's ML course), but it looks like a cool R introduction.



using R seems a bit strange to me -- i have nothing against R (use it regularly), but not exactly a "hacker" language. this seems like an ideal book for python, but R?


I had the same feeling when I saw this. R is a great tool for statistics, very useful when you need to do in-depth statistical analysis (analysis of variance, etc.). It doesn't strike me as a good choice for a hacker's book - which makes me think about their reasons to use this word, hacker. Maybe they are just trying to benefit from the buzz that it generates these days?


indeed, this was my first reaction as well...seems like a bit of marketing gimmick to toss the word "hacker" into the title.


Why are hackers allowed to use Python but not R? R is a quite powerful Scheme-like language with an incredible math library. Python is a "teaching language", officially.


Python or a jvm language should be a better choice


thanks, just what I was looking for!

unfortunately I see here, that there seems to be a large part of the book spent on a statistics introduction (including R) and only one machine learning algorithm actually gets introduced. and on top of all it's one of the most simple ones (naive bayes). I expected at least some further description of support vector machines or other advanced techniques.


Just a quick note if you're interested, this book is similar and is an absolutely fantastic 'applied beginner' ML book "Programming Collective Intelligence"

http://www.amazon.com/Programming-Collective-Intelligence-Bu...


I have an early review copy if this book and know both the authors. It's good and I highly recommend it!


This seems like a field that could pretty easily be commodified. I can imagine a service like the Google prediction API could meet the needs for this kind of tech for many companies.

So while it's certainly an interesting field, I wonder how many hackers are really going to need these skills.


The biggest problem with machine learning occurs when people subscribe to the belief that it's a black-box solution. The truth is that you can't just drag-and-drop your data into a pre-existing solution. The types of algorithms you use depend on the types of problems you're trying to solve (e.g., classification, regression, clustering). The data you collect depends on the algorithms you use.

Sure, prediction APIs could arise that give detailed use cases for each algorithm, but then there's a problem with the fringe cases: you might not know that two pieces of data are so heavily correlated that they completely shatter a conditional independence assumption, for example.

As a hacker who originally subscribed to the belief that a thorough understanding of machine learning was overkill, it is without hesitation that I admit being 100% wrong. The truth of the matter is that when it's done properly, artificial intelligence and machine learning ought to be inextricably linked with your core business processes.


Agree: I had a dataset for work no one had yet been able to use in categorizing two effects (one category was 98% of all the data). The values looked too "Gaussian normal" with everything mixed up. It couldn't be separated out, but a combination of SVM and in dept knowledge of the source of the data and I was able to find a generalized model that could accurately categorize parts 80%+ of the time for the small set, without misclassifying the other 98%. All other methodologies had failed up to that point and a blind approach with linear regression or SVMs resulted in at best 70% accuracy on all categories... not very good or implementable in a production setting (that means in the bulk of cases the 98% I was only correct 70% of the time).


I can certainly see a role for somebody that understands the tradeoffs of each of these algorithms and that understands how to properly select and prepare dataasets. But I wonder how many people will really need to be able to actually implement these algorithms.


They're brutally simple to implement much of the time. The difficulty comes in two places:

    1. Derivation of slight variations on the basic principles 
    2. Scaling.
Both are very difficult.


That was also the whole point of the stanford ML course. TO teach exactly that skillset.

Sure we did some basic implementations in octave, it helps to have some idea of the internals. But that wasn't the goal of the course.


Well, no. ml-class was a series of demos. Plugging in a formula is 1-line developing an algorithm. The original cs229 is more like what you describe.


Well, no. ml-class was a series of demos. The data was all curated in advance and the models pre-selected appropriately.


I think the regress you're talking about is super important---black box AI only goes so far---but I also think there's great benefit to just applying the first layer of broken, incorrectly paired ML to a new field.

My prediction is that even the most black box ML, creatively applied, is and will be an incredible skill. Increasing levels of sophistication will continually kill off the current practices of black box ML, but the willingness to apply statistical pattern recognition to new and interesting areas can't help but be incredible.


As someone studying this intensely, it's quite the opposite. Basic ML can (and has) been commodified with good toolkits and APIs. Additionally, much of practical ML is just applications of already invented algorithms to fields that just haven't even seen them yet.

But that said, the deeper message is in interpretation and discovery from data. Large data, small data, highly structured data, or just regularized DB pulls. The heart of it is statistical pattern recognition and it's really just begun to be broached (even academically) in the last 25 years.


I respectfully disagree. Tools like Weka, nltk, etc. are okay for exploratory data analysis, but it's risky to use them for problems that scale, problems that differ from the norm, or homegrown solutions for data that does not yet exist. Because a large portion of HN users are interested in bringing their ideas to life, I'd suspect that the latter particularly resonates with them.

The problem facing people who intend to work with data that does not yet exist becomes one of feature selection: what data matters and how do we use it? For NLP tasks, does stemming matter? What about part-of-speech tagging? Some classification problems are not linearly separable, which makes certain kernel methods impossible without using (and knowing to use) the kernel trick.

In the end, I think my reply here is tautological: ML is too complex to be transformed into a set of APIs a la Google Maps and Google Search.


The problem facing people who intend to work with data that does not yet exist becomes one of feature selection: what data matters and how do we use it? For NLP tasks, does stemming matter? What about part-of-speech tagging?

Indeed, I worked on machine learning in NLP (fluency ranking, parse disambiguation). As a general rule, roughly 90% of the improvement of models is in clever feature engineering and exploiting the underlying system to get more interesting information that improves classification, 10% you get from using more clever machine learning techniques than, say a standard maxent learner with a gaussian prior (for linearly separable data).

For instance, the last relatively large boosts of the accuracy of the parser developed by our research group came from feature engineering:

http://www.let.rug.nl/vannoord/papers/iwpt2007.pdf


I think we just disagree on what "basic" ML means. I think a lot of real problems have solutions which involve very simple applications of poorly tuned ML algorithms.

Engineering even a basic ML solution is challenging---feature engineering especially.


Sorry, my mistake; I misread the comment to which I was replying as a response to my earlier comment. I completely agree with everything you've said.


Actually, Google Prediction API is very simple and it covers supervised learning (regression and classification) already. I can imagine very simple extensions (of the API itself, the algorithms would be completely different) to cover a lot of the unsupervised and semi-supervised ground as well.

The algorithms are not disclosed, but the docs hint that they are properly regularized so throwing more features at them is always good.

You still need to be able to reformulate the problems so that they fit a standard ML setting and then know how to tune things, but it looks like the API can get you pretty far.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: