Hacker News new | past | comments | ask | show | jobs | submit login
Machine Learning: A Summary on Feature Engineering Using R Language (microsoft.com)
137 points by unbuckledbee on March 24, 2017 | hide | past | favorite | 21 comments



Over fitting can, indeed, be "fitting the data too perfectly"; e.g., Lagrangian interpolation where have some X-Y pairs and want a polynomial in x that fits the X-Y pairs. How does that work? You can derive it for yourself: For each of the X-Y pairs, write a polynomial that is zero at all the X points except the one pair you want to fit perfectly and then add all those polynomials. Done. Presto. Bingo. "Look, Ma, fits anything perfectly!" (as long as all the X points are distinct). "I should be able to find a polynomial that fits points from a square wave!" Yup, but graph the polynomial and see what it does away from one of the given X points -- the graph goes wild, that is, shooting off to plus or minus infinity and appears nearly to get there.

But in linear fitting, over fitting is simpler: There, of course, do get a linear function which doesn't go wild. And the function can fit the data nicely. The usual problem is that the coefficients in the linear function are not unique.


What they describe is actually (very) shallow feature engineering and these methods have been used for decades. These techniques can hardly help in really difficult data analysis tasks where abstract features need to be either manually defined or automatically mined.


Yeah! And they use "feature engineering" where "feature selection" would suffice. The hard part of feature engineering is usually not selection, but creatively generating features based on expert knowledge, etc. Also, something like a wrapper would be an overkill if you had roughly, #data/class > #features and used a relatively overfitting-resistant classifier like SVM, regularized regression, etc. (more rigorous statement of this would involve VC-dimension).


What does this paragraph even mean?

"On the same note, though perfectly correlated variables are redundant and might not add value to the model (and if removed, would be computationally efficient), a high-variable correlation could have additional information to add. In other words, two variables that are not correlated could still be of importance to the model. When in doubt, it’s safer to train the model and observe it’s performance."

The "In other words" part seem to talk about uncorrelated variables while the sentence before that talked about highly correlated variables. I always thought to discard one of two variables that were highly correlated.


Yeah, I'm not sure either, that seems to be an editing mangle.

The bit about "it's safer to train the model and observe its performance", though, is reasonable advice. The main problem with multicollinearity is that it gives you a model with poorly-defined coefficients - that is, their standard errors are high. That's a big problem if you're coming at the problem with a statistician's mindset and trying to come up with a parsimonious model with statistically significant parameter estimates. If you're just going for the best predictive model you can get, though, then you don't necessarily care about super tight standard errors on all your coefficients, so it's not such a big deal.


Or use a factor or linear combination. Lots of ways to address multicollinearity/micronumerosity.


A colleague of mine is quite happy using caret for streamlined feature selection http://topepo.github.io/caret/index.html


Is there a Python equivalent of this package?


The closest equivalent I can think of is scikit-learn [1], which also gives you a unified way of using many different algorithms. I wouldn't say the two are really equivalent but both are excellent and a joy to use. One major difference is that, as far as I know, caret is mainly a standardization wrapper for other R packages' functionality, while scikit uses its own implementations.

[1] http://scikit-learn.org/stable/


See ISLr (linked in OP's post) for a good exploration of feature selection http://www-bcf.usc.edu/~gareth/ISL/


This seems specific to Microsoft R or can I install the MicrosoftML package in regular R and/or use it under Linux? I really wonder what MS's investment in R means for the future of R.


They're just trying to not lose in the ML space that's all. Wants people to use their cloud server.

Just cause it's specific to microsoftR and the package doesn't mean those algorithms are own by microsoft. You just have to google other packages that have those algorithms. All of them are just statistic... R have tons of statistic package more than Python. I will be highly skeptical if you can't find a LDA, PCA, etc.. in a package somewhere.

The only important part here is the technical knowledge of what it are these algorithms for and when to use them and when not to.


MS has a history of "embracing" technologies. The arrogance is visible in the title that doesn't mention it's specific to MS R.

I didn't say you can't do this with standard R. I didn't say they invented or own anything.


> MS has a history of "embracing" technologies.

Are you kidding me? You're telling me they're embracing it and not trying to get into the ML share or playing catch up ever since Hadoop came about?

> This seems specific to Microsoft R or can I install the MicrosoftML package in regular R and/or use it under Linux?

You asked a question and I provided an answer. Just cause you didn't like it doesn't mean it's not an alternative to Revolution R. You're just taking it as if it's some attack on you, grow up. I'm stating that if it's not possible then you can always find package for it. They're statistic algorithms that you learn in stat grad courses and R is a statistic language. So you can find it if you can't use Microsoft stuff. So you don't have to worry if it works or not.

I also don't get this fascination on this library. You can just use packages that is agnostic to what R version (R Revolution microsoft or regular R).

They're trying to sell their version of R and hopefully their cloud stuff like Azure.

> The arrogance is visible in the title that doesn't mention it's specific to MS R.

I have no idea what that mean.

All the mx prefix functions are Revolution R only.


I didn't downvote your comment but I don't understand what you're talking about.


a table of similar function using R and the other R: https://msdn.microsoft.com/en-us/microsoft-r/scaler/compare-...


Somewhat related: adding features based on trending concepts http://54.174.116.134/recommend/datasets/


That looks interesting, but without any additional details, or even other pages on the website it is difficult to consider using it.


Seriously, it's just use these data too! They'll help your models. Nothing about how these features were selected what algorithm used.


I don't trade, so they won't help my models.

I'm interesting in how they are made though.


glad to see young kids studying!! Hope you continue to take on more advanced projects!!!!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: