“‘Principal Component Analysis’ is a dimensionally invalid method”

christopheraden · on April 18, 2013

There is actually a fix to this problem in classical statistics (as far back as Pearson in the early 20th century) if the object is to perform PCA on data matrices: Don't use the covariance matrix for PCA. The issue of units is only a problem if you're using a matrix that has units itself. This problem is readily solved by using the correlation matrix instead, which is dimensionless by definition. The downside to this circumvention is that you have essentially re-weighted each of your variables, so the weight contributed by each variable is more similar. This may not be what you want.

If PCA is not being used to reduce the dimensionality of multivariate data, this might be invalid. There are other uses of PCA besides working on data (that image reduction technique using SVD comes to mind) that he might be addressing.

If you want a treatment of PCA in a respected text (at least by the statistical community--not sure what the ML people think ;) ), look no further than Hastie, Tibshirani and Friedman's Elements of Stastical Learning.

http://www.stanford.edu/~hastie/local.ftp/Springer/OLD//ESLI...

radarsat1 · on April 18, 2013

> The downside to this circumvention is that you have essentially re-weighted each of your variables, so the weight contributed by each variable is more similar.

Is this similar to "whitening"?

christopheraden · on April 18, 2013

It is somewhat similar. The procedure I describe is normalizing (using z-scores instead of raw data) the data. The difference, as far as I can tell, is that normalizing retains the correlation structure of the data, while whitening uncorrelates the observations. R is the lingua franca of the academic statisticians, so you might not derive a huge amount of value from this, but this question was asked a couple years ago on Stats.SE: http://stats.stackexchange.com/questions/53/pca-on-correlati...

chmike · on April 19, 2013

Isn't this ICA, independent component analysis ? This makes much more sense when trying to extract information from the combination of multiple variables independently of their unit type and scale.

jules · on April 18, 2013

This kind of complaint applies to any method. Dimensional analysis just means that your data satisfies a certain kind of symmetry. A scaling symmetry in this case. It doesn't matter how long you define one meter, as long as you do it consistently. Hence any answer should not depend on the length of a meter. There are other symmetries, for example it does not matter where you put 0 on the temperature scale. Hence your answer should not depend on where you put that zero (unless that zero has special significance in your context). This is a kind of additive symmetry. Another additive symmetry is that it shouldn't matter which year you call year 0 (unless it has special significance, e.g. you are investigating the birth of jesus). In the same way your data can have any kind of symmetry, especially with multi-dimensional data you often get extra symmetries. For example it shouldn't matter in which direction you define north and east.

For any given problem, you should generally only use methods that obey the symmetries that your data has. This doesn't mean that PCA is invalid as a method, it's just only valid on a data where the scaling symmetry does not apply. Another example would be fitting a line through the origin for temperature data. That's invalid, because the result you get depends on where you define your zero (but as before, it might be valid if that zero has special significance in your context). Does that mean that fitting a line through the origin is invalid for any data set? No.

In other words, the same criticism could be applied to any given method. Just choose any symmetry that the method does not respect, and then declare it completely invalid. Hence we cannot dismiss a method outright purely based on this reasoning. For example PCA is perfectly valid on unitless data. What's even stranger is that the author does like neural neworks, which are certainly not dimensionally valid, heck they probably don't satisfy any real world symmetries. This is also a case where it can be OK to use a dimensionally inconsistent method. As long as it works, it works.

morpher · on April 18, 2013

In addition to unitless data, PCA works when all of your variables have the same dimension. It simply doesn't make sense to build a principle component vector that is a mixture of (i.e. weighted sum of) vectors with non-identical units.

tgflynn · on April 18, 2013

If you change the units that one of the variables is measured in, it will change all the "principal components"!

So what ? The same is true of most machine learning methods including neural networks and SVM's, you just have to use the same units consistently.

mbq · on April 19, 2013

Or use a machine learning method that is fully invariant to all monotonic transformations, for instance random forest.

mturmon · on April 18, 2013

Quite true. And, sometimes things are already in comparable units. For example, grayscale image intensities.

I don't see the problem with PCA if it is used sensitively to identify the approximate dimensionality of "pancakes" embedded in a larger space.

wookietrader · on April 19, 2013

Wrong.

Neural networks are affinity invariant. You can rotate, skew, translate the input data however you want, the optimum stays the same. Same for SVMs.

tgflynn · on April 19, 2013

But the weights will be different, which it seems to me is comparable to saying that the principal components change.

binarysolo · on April 18, 2013

Maybe I'm having a brainfart or something... but since PCA is eigenvector/eigenspace-based and essentially determines linear noncorrellation of the different vectors, changing units of measurement shouldn't change which dimensions are most different about said vectors?

Edit: Ah right - http://en.wikipedia.org/wiki/Whitening_transformation - let covariance matrix be I.

thedufer · on April 18, 2013

That's what I thought when I read this, but I haven't looked at PCA in awhile, so I wasn't sure. It's only relative differences on each axis that matter, right?

robrenaud · on April 18, 2013

No.

PCA tries to project to the subspace that preserves as much distance in the input space possible. If you multiply a coordinate in the input space by a factor of 2, it will contribute relatively more to the distances, and hence change the fitted projection beyond just a scaling factor.

thisrod · on April 19, 2013

So we're talking about a very general problem. When I noticed that a system of equations including x=y can have a different least-squares solution when you change it to 2x=2y, I was quite surprised.

binarysolo · on April 18, 2013

Thanks for the refresher, and a followup basic question: what methods are best at just ranking relative variance of each dimension/component regardless of unit scaling? SVDs, ?

Edit: Ah right - http://en.wikipedia.org/wiki/Whitening_transformation

aheilbut · on April 18, 2013

Lots of kinds of analyses require some kind of normalization or whitening to work properly.

The problem with PCA is that everyone knows about it and kind of understands it, and thinks that it'll magically tell them something interesting about their data. Especially when you can take the first three components and make cool-looking 3D plots...

kyzyl · on April 18, 2013

> If you change the units that one of the variables is measured in, it will change all the "principal components"

By the way that's worded, it sounds like a case of sensitivity rather than the method. If you change one of the variables to a unit that is way out of scale, then it's quite possible that the results of PCA, and many other methods, will change. But that's because it's not scale invariant, and so if you want good results you need to present your variables in the same units, and/or in some normalized format (zscored, etc.) where the scale of one unit doesn't blow the others out of the water.

These methods are not magical, and they are not intelligent. They do not know what they're looking at, so it's your job to feed them something reasonable.

That said, if you take some physically grounded data and change units, you won't recover different eigenvectors as long as your input data is good. The physics doesn't care about the units, or the coordinate system, or whether you use python or anything else.

tgflynn · on April 18, 2013

if you take some physically grounded data and change units, you won't recover different eigenvectors as long as your input data is good.

I don't think that's accurate. Consider a set of points distributed along a line in 2D. If you do PCA on these points you will find that the 1st eigenvector points along the line and the second is orthogonal to it.

If you now rescale the axes so that x is measured in meters and y in light years the slope of the line will change and so will the the 2 PCA eigenvectors.

However the relationship between the 2 eigenvectors and the distribution of the data points will remain the same. The first eigenvector will still point along the line and the second will still be orthogonal to it.

In machine learning one is interested in the distribution of the data not in whatever units they happen to be measured in, hence I don't understand MacKay's objection.

jessriedel · on April 18, 2013

The mathematical way to state this is that a choice of units is equivalent to a choice of an inner product in linear algebra( or, more generally, the choice of a metric tensor in differential geometry). Basically, the choice of an inner product defines what it means for something to be orthogonal in a space.

Principal component analysis is predicated on a choice of inner product, since component directions are always chosen to be orthogonal. It's not clear what orthogonality could mean in a 2D plane where one direction is measured in inches and the other in tons, so naive PCA isn't appropriate in such a case.

Others have mentioned in this thread that "whitening" the data before PCA fixes this problem, by removing cross-correlations. Presumably, in that case, the notion of orthogonality is taken from the statistical properties of the data. (Maybe it normalizes physical units like inches to the standard deviation of the data's distribution in inches?)

photon137 · on April 18, 2013

But the principal eigenvectors can change if scales are changed - and the directions in which they may then point can be completely different.

tel · on April 18, 2013

In practice if you're working in high dimensions and don't do PCA then you have very little chance of building a good model. It certainly isn't linearly valid if to you valid demands scale invariance, but it's an essential tool for unsupervised feature engineering.

SatvikBeri · on April 18, 2013

I definitely agree that you need dimensionality reduction, though there are methods other than PCA (e.g. auto-encoders.)

dbecker · on April 18, 2013

Auto-encoders have some big advantages over PCA, but they suffer the same shortcoming (sensitivity to units of measurement) described in the original post.

radarsat1 · on April 18, 2013

In practice you have to scale data to [-1,1] or [0,1] for a neural net anyway, right? (Depending on the kernel function.)

SatvikBeri · on April 18, 2013

Another common method is to scale to mean 0, variance 1. In my opinion this makes more sense since it handles outliers a bit better-e.g., consider a case where most of your values for a feature are scaled from 1 to 10 but there's one point with value 1,000,000.

dbecker · on April 19, 2013

I agree that you SHOULD normalize/scale data before running neural networks and autoencoders.... and this resolves the units issue in most cases (unless measurements in some units are non-linear functions of measurements in others).

But this scaling also resolves the issue for PCA. So, I don't see much difference between autoencoders and PCA with regards to original post's "dimensional invalidity" concern.

If anything, the scaling options you mention suggest "dimensional invalidity" isn't a big deal in practice for either method.

dbecker · on April 18, 2013

Regularization is an excellent alternative to dimensionality reduction in most cases.

SatvikBeri · on April 18, 2013

To expand on this a bit (please correct me if I'm wrong), there are three major uses of dimensionality reduction:

1. Reduce overfitting

2. Train models faster

3. Take advantage of unsupervised data

Regularization handles case (1) quite well. It can also be used in conjunction with most methods of dimensionality reduction such as PCA/auto-encoders.

PCA covers all three, but isn't as effective at dimensionality reduction compared to auto-encoders.

Auto-encoders tend to yield a better compression than PCA but take more time to train and produce output that's harder to understand. There is a bit of analogy here, auto-encoders are to PCA what neural nets are to linear regression.

Homunculiheaded · on April 19, 2013

L1 regularization can also be used for feature selection[0] and hence dimensionality reduction. I've in found practice that the speed of this can compare favorably even to fast implementations[1] of svd.

[0] Feature selection, L1 vs. L2 regularization, and rotational invariance, Andrew Y. Ng. In Proceedings of the Twenty-first International Conference on Machine Learning, 2004. http://ai.stanford.edu/~ang/papers/icml04-l1l2.pdf

[1]"Augmented Implicitly Restarted Lanczos Bidiagonalization Methods", J. Baglama and L. Reichel, SIAM J. Sci. Comput. 2005.

brosephius · on April 18, 2013

Wouldn't readers be better off if the author wrote about this problem with PCA in the book, instead of ignoring it entirely?

christopheraden · on April 18, 2013

While I agree with you, if the author believed the entire method is invalid in all situations, he could make a strong argument that it wasn't worth the space to put it in at all. It would be a long book if it was filled with all the methods currently in production that did _not_ work!

jessriedel · on April 18, 2013

PCA extremely widely known and (ab)used. You can't justify ignoring it.

dimatura · on April 18, 2013

"X is a ... method that gives people a delusion that they are doing something useful with their data." could be applied to pretty much any method in ML if you don't know what you're doing. PCA is a strange omission, but it's not like it's hard to find references on it. ITILA is a great book and it's legally free online, by the way.

HelloMcFly · on April 18, 2013

If anyone is interested, PCA is used in scale development to help reduce the number of items down to something useful. It is typically followed by a factor analysis to determine factors, though far too many use PCA in place of a factor analysis. I didn't realize it was used much outside of measurement.

kyzyl · on April 18, 2013

PCA and other similar forms of dimensionality reduction are used all over the place, from healthcare to control systems to financial analysis. These days it is most often used as a preprocessing step for some other analysis routine, though.

revelation · on April 19, 2013

In data compression, there is what is called the Karhunen-Loeve or Hotelling transform. It's essentially an alias for "Principal Component Analysis".

The Hotelling transform is interesting in that it achieves optimal energy compaction, but has little practical value since it needs to be constructed anew for each dataset (usually, images).

Just to pull your thoughts away from massive, unfiltered big data :)

pessimizer · on April 18, 2013

I wish that someone would inform the field of psychology about this. Principal Component Analysis is often a way of simply reifying your prejudices.

n00b101 · on April 19, 2013

So what alternative method of dimensionality reduction does the author recommend?

Considering that PCA works fine in the case that the dimensions form a proper vector space (e.g. geometrical space, stock market returns, temperature, etc, etc ,etc), it seems questionable to completely dismiss such a useful and historically important method.

streptomycin · on April 18, 2013

Open Matlab and type

    load hald;
    coeff = pca(zscore(ingredients))
    ingredients(:,1) = ingredients(:,1) * 2;
    coeff = pca(zscore(ingredients))

Magically, you get the same result regardless of if you change the units of one variable.

christopheraden · on April 18, 2013

Yes, of course you will--because you've normalized the data. If you had run PCA on just ingredients instead of on the normalized ingredients, I would imagine the results would be different. I can't verify since I don't have matlab on this computer, but doing PCA on raw data with one set of units will produce a different PCA doing it on data with another set of units.

streptomycin · on April 18, 2013

Yes, you get different results if you don't normalize. My point was, I don't see why "you have to trivially normalize your data first" is a meaningful argument against anything.

christopheraden · on April 18, 2013

Because normalization reduces the influence of variables that have a higher variance. In raw data, if you have marathon times and heights for runners in a race, and you measure the times in minutes and the heights in inches, the influence on the principal components from the times will likely be much more.

Like all problems in statistics, it ought to depend on the specific task at hand. If there is some a priori reason to use the original scale (or a different re-weighting), it ought to be used. In general, PCA on correlation matrices is much preferred for exactly the reason you mention.

bhickey · on April 19, 2013

I'm still waiting for Information Theory to be updated with the errata I submitted. In at least two places (pages 458 and 459) he mentions the possibility of accepting (!?) the null hypothesis.

snark.

sesqu · on April 19, 2013

What was your replacement suggestion? I would say acceptance is often a fair description for lack of rejection.

bhickey · on April 20, 2013

The null hypothesis is never accepted. You can fail to reject the null, but this is entirely different from accepting it.

If you claimed to accept the null in an intro statistics class, you'd probably be failed.

sesqu · on April 20, 2013

That's an extremely dogmatic position. There are certainly situations where it is not useful or even accurate to proceed under the null hypothesis, but the converse is also very much true. Checking Wikipedia, I see they also describe the scenario as "accept or fail to reject".

If you want to argue for why "accept" is materially different from "fail to reject", feel free to do so - but I suggest that the chasm is by no means wide.

ankitml · on April 19, 2013

You can always make variables in any system non dimensional. The whole field of chemical engineering works on this principle. Once you do that, you are free of irregularities/issues that stem from scaling.

Apply PCA after non dimensionalizing any system. Read More here : http://en.wikipedia.org/wiki/Nondimensionalization

cf · on April 18, 2013

Yes, this is a well known problem with PCA. So often we just whiten (http://en.wikipedia.org/wiki/Whitening_transformation) the data first.

tgflynn · on April 18, 2013

Whitening and (full rank) PCA use the same linear transformation except whitening scales the eigenvalues so all axes in the transformed system have unit variance.

In other words whitening the data before applying PCA should result in the same eigenvectors expressed in the original coordinate system.

cf · on April 18, 2013

Yes, but I think for many places where PCA is used, we are precisely interested in which eigenvectors have the largest eigenvalues. The scaling beforehand makes a unit change less likely to effect which eigenvectors are the most important.

tgflynn · on April 18, 2013

But once you know the eigenvectors the eigenvalues (and hence their distribution) are determined, again in the original space.

On the other hand if you're talking about the eigenvalues for the whitened data they're all 1.

So I'm still not seeing what whitening adds to PCA.

Just rescaling each dimension of the original space so that all dimensions have unit variance, without doing any rotations, may change things, but I don't think that's what is usually called whitening (according to Wikipedia).

mturmon · on April 18, 2013

You are absolutely correct, and your interlocutor is mistaken to use a whitening transformation before PCA.