For polynomial regression of the type y = p0 + p1*x + p2*x^2 + ... + pn*x^n, the...

unishark · on Aug 27, 2020

> ... (X^T * X)^-1 ...

This is the matrix inversion I was referring to. It's size (at best) depends on the smaller of the number of parameters and the amount of training samples. Both get very big in machine learning. When this happens you need to use some kind of low-memory iterative method like Greville's algorithm or even gradient descent itself. So you're ultimately not any better off.

pidtuner · on Aug 27, 2020

In practice one computes (X^T * X)^-1 * (X^T) in one go using Singular Value Decomposition, for which very efficient algorithms exists. But if there is really a lot of data, then recursive linear least squares can be used, to partition the larger least squares into smaller pieces. But then again, you just make one pass on the data, not multiple passes, like with gradient descent.

unishark · on Aug 27, 2020

Interestingly (another empirical result that's poorly understood), with stochastic gradient descent, convergence often only requires one pass through the data, if not it might take a small number.

And yes this field only exists because we are presuming a really large amount of data, which often can't even fit on the same hard drive. And a really complex model.

Older kernel methods basically do what you want for tractable datasets. They can do very high-order polynomials, and also add the ability to regularize the solution various ways. Though again, I would be interested in seeing those methods compared to a simple least-squares fit as you propose, which people often didn't do even back when kernel methods were all the rage.