There is actually a fix to this problem in classical statistics (as far back as Pearson in the early 20th century) if the object is to perform PCA on data matrices: Don't use the covariance matrix for PCA. The issue of units is only a problem if you're using a matrix that has units itself. This problem is readily solved by using the correlation matrix instead, which is dimensionless by definition. The downside to this circumvention is that you have essentially re-weighted each of your variables, so the weight contributed by each variable is more similar. This may not be what you want.
If PCA is not being used to reduce the dimensionality of multivariate data, this might be invalid. There are other uses of PCA besides working on data (that image reduction technique using SVD comes to mind) that he might be addressing.
If you want a treatment of PCA in a respected text (at least by the statistical community--not sure what the ML people think ;) ), look no further than Hastie, Tibshirani and Friedman's Elements of Stastical Learning.
> The downside to this circumvention is that you have essentially re-weighted each of your variables, so the weight contributed by each variable is more similar.
It is somewhat similar. The procedure I describe is normalizing (using z-scores instead of raw data) the data. The difference, as far as I can tell, is that normalizing retains the correlation structure of the data, while whitening uncorrelates the observations.
R is the lingua franca of the academic statisticians, so you might not derive a huge amount of value from this, but this question was asked a couple years ago on Stats.SE: http://stats.stackexchange.com/questions/53/pca-on-correlati...
Isn't this ICA, independent component analysis ? This makes much more sense when trying to extract information from the combination of multiple variables independently of their unit type and scale.
This kind of complaint applies to any method. Dimensional analysis just means that your data satisfies a certain kind of symmetry. A scaling symmetry in this case. It doesn't matter how long you define one meter, as long as you do it consistently. Hence any answer should not depend on the length of a meter. There are other symmetries, for example it does not matter where you put 0 on the temperature scale. Hence your answer should not depend on where you put that zero (unless that zero has special significance in your context). This is a kind of additive symmetry. Another additive symmetry is that it shouldn't matter which year you call year 0 (unless it has special significance, e.g. you are investigating the birth of jesus). In the same way your data can have any kind of symmetry, especially with multi-dimensional data you often get extra symmetries. For example it shouldn't matter in which direction you define north and east.
For any given problem, you should generally only use methods that obey the symmetries that your data has. This doesn't mean that PCA is invalid as a method, it's just only valid on a data where the scaling symmetry does not apply. Another example would be fitting a line through the origin for temperature data. That's invalid, because the result you get depends on where you define your zero (but as before, it might be valid if that zero has special significance in your context). Does that mean that fitting a line through the origin is invalid for any data set? No.
In other words, the same criticism could be applied to any given method. Just choose any symmetry that the method does not respect, and then declare it completely invalid. Hence we cannot dismiss a method outright purely based on this reasoning. For example PCA is perfectly valid on unitless data. What's even stranger is that the author does like neural neworks, which are certainly not dimensionally valid, heck they probably don't satisfy any real world symmetries. This is also a case where it can be OK to use a dimensionally inconsistent method. As long as it works, it works.
In addition to unitless data, PCA works when all of your variables have the same dimension. It simply doesn't make sense to build a principle component vector that is a mixture of (i.e. weighted sum of) vectors with non-identical units.
Maybe I'm having a brainfart or something... but since PCA is eigenvector/eigenspace-based and essentially determines linear noncorrellation of the different vectors, changing units of measurement shouldn't change which dimensions are most different about said vectors?
That's what I thought when I read this, but I haven't looked at PCA in awhile, so I wasn't sure. It's only relative differences on each axis that matter, right?
PCA tries to project to the subspace that preserves as much distance in the input space possible. If you multiply a coordinate in the input space by a factor of 2, it will contribute relatively more to the distances, and hence change the fitted projection beyond just a scaling factor.
So we're talking about a very general problem. When I noticed that a system of equations including x=y can have a different least-squares solution when you change it to 2x=2y, I was quite surprised.
Thanks for the refresher, and a followup basic question: what methods are best at just ranking relative variance of each dimension/component regardless of unit scaling? SVDs, ?
Lots of kinds of analyses require some kind of normalization or whitening to work properly.
The problem with PCA is that everyone knows about it and kind of understands it, and thinks that it'll magically tell them something interesting about their data. Especially when you can take the first three components and make cool-looking 3D plots...
> If you change the units that one of the variables is measured in, it will change all the "principal components"
By the way that's worded, it sounds like a case of sensitivity rather than the method. If you change one of the variables to a unit that is way out of scale, then it's quite possible that the results of PCA, and many other methods, will change. But that's because it's not scale invariant, and so if you want good results you need to present your variables in the same units, and/or in some normalized format (zscored, etc.) where the scale of one unit doesn't blow the others out of the water.
These methods are not magical, and they are not intelligent. They do not know what they're looking at, so it's your job to feed them something reasonable.
That said, if you take some physically grounded data and change units, you won't recover different eigenvectors as long as your input data is good. The physics doesn't care about the units, or the coordinate system, or whether you use python or anything else.
if you take some physically grounded data and change units, you won't recover different eigenvectors as long as your input data is good.
I don't think that's accurate. Consider a set of points distributed along a line in 2D. If you do PCA on these points you will find that the 1st eigenvector points along the line and the second is orthogonal to it.
If you now rescale the axes so that x is measured in meters and y in light years the slope of the line will change and so will the the 2 PCA eigenvectors.
However the relationship between the 2 eigenvectors and the distribution of the data points will remain the same. The first eigenvector will still point along the line and the second will still be orthogonal to it.
In machine learning one is interested in the distribution of the data not in whatever units they happen to be measured in, hence I don't understand MacKay's objection.
The mathematical way to state this is that a choice of units is equivalent to a choice of an inner product in linear algebra( or, more generally, the choice of a metric tensor in differential geometry). Basically, the choice of an inner product defines what it means for something to be orthogonal in a space.
Principal component analysis is predicated on a choice of inner product, since component directions are always chosen to be orthogonal. It's not clear what orthogonality could mean in a 2D plane where one direction is measured in inches and the other in tons, so naive PCA isn't appropriate in such a case.
Others have mentioned in this thread that "whitening" the data before PCA fixes this problem, by removing cross-correlations. Presumably, in that case, the notion of orthogonality is taken from the statistical properties of the data. (Maybe it normalizes physical units like inches to the standard deviation of the data's distribution in inches?)
In practice if you're working in high dimensions and don't do PCA then you have very little chance of building a good model. It certainly isn't linearly valid if to you valid demands scale invariance, but it's an essential tool for unsupervised feature engineering.
Auto-encoders have some big advantages over PCA, but they suffer the same shortcoming (sensitivity to units of measurement) described in the original post.
Another common method is to scale to mean 0, variance 1. In my opinion this makes more sense since it handles outliers a bit better-e.g., consider a case where most of your values for a feature are scaled from 1 to 10 but there's one point with value 1,000,000.
I agree that you SHOULD normalize/scale data before running neural networks and autoencoders.... and this resolves the units issue in most cases (unless measurements in some units are non-linear functions of measurements in others).
But this scaling also resolves the issue for PCA. So, I don't see much difference between autoencoders and PCA with regards to original post's "dimensional invalidity" concern.
If anything, the scaling options you mention suggest "dimensional invalidity" isn't a big deal in practice for either method.
To expand on this a bit (please correct me if I'm wrong), there are three major uses of dimensionality reduction:
1. Reduce overfitting
2. Train models faster
3. Take advantage of unsupervised data
Regularization handles case (1) quite well. It can also be used in conjunction with most methods of dimensionality reduction such as PCA/auto-encoders.
PCA covers all three, but isn't as effective at dimensionality reduction compared to auto-encoders.
Auto-encoders tend to yield a better compression than PCA but take more time to train and produce output that's harder to understand. There is a bit of analogy here, auto-encoders are to PCA what neural nets are to linear regression.
L1 regularization can also be used for feature selection[0] and hence dimensionality reduction. I've in found practice that the speed of this can compare favorably even to fast implementations[1] of svd.
[0] Feature selection, L1 vs. L2 regularization, and rotational invariance, Andrew Y. Ng. In Proceedings of the Twenty-first International Conference on Machine Learning, 2004. http://ai.stanford.edu/~ang/papers/icml04-l1l2.pdf
[1]"Augmented Implicitly Restarted Lanczos Bidiagonalization Methods", J. Baglama and L. Reichel,
SIAM J. Sci. Comput. 2005.
While I agree with you, if the author believed the entire method is invalid in all situations, he could make a strong argument that it wasn't worth the space to put it in at all. It would be a long book if it was filled with all the methods currently in production that did _not_ work!
"X is a ... method that gives people a delusion that they are doing something useful with their data." could be applied to pretty much any method in ML if you don't know what you're doing. PCA is a strange omission, but it's not like it's hard to find references on it. ITILA is a great book and it's legally free online, by the way.
If anyone is interested, PCA is used in scale development to help reduce the number of items down to something useful. It is typically followed by a factor analysis to determine factors, though far too many use PCA in place of a factor analysis. I didn't realize it was used much outside of measurement.
PCA and other similar forms of dimensionality reduction are used all over the place, from healthcare to control systems to financial analysis. These days it is most often used as a preprocessing step for some other analysis routine, though.
In data compression, there is what is called the Karhunen-Loeve or Hotelling transform. It's essentially an alias for "Principal Component Analysis".
The Hotelling transform is interesting in that it achieves optimal energy compaction, but has little practical value since it needs to be constructed anew for each dataset (usually, images).
Just to pull your thoughts away from massive, unfiltered big data :)
So what alternative method of dimensionality reduction does the author recommend?
Considering that PCA works fine in the case that the dimensions form a proper vector space (e.g. geometrical space, stock market returns, temperature, etc, etc ,etc), it seems questionable to completely dismiss such a useful and historically important method.
Yes, of course you will--because you've normalized the data. If you had run PCA on just ingredients instead of on the normalized ingredients, I would imagine the results would be different. I can't verify since I don't have matlab on this computer, but doing PCA on raw data with one set of units will produce a different PCA doing it on data with another set of units.
Yes, you get different results if you don't normalize. My point was, I don't see why "you have to trivially normalize your data first" is a meaningful argument against anything.
Because normalization reduces the influence of variables that have a higher variance. In raw data, if you have marathon times and heights for runners in a race, and you measure the times in minutes and the heights in inches, the influence on the principal components from the times will likely be much more.
Like all problems in statistics, it ought to depend on the specific task at hand. If there is some a priori reason to use the original scale (or a different re-weighting), it ought to be used. In general, PCA on correlation matrices is much preferred for exactly the reason you mention.
I'm still waiting for Information Theory to be updated with the errata I submitted. In at least two places (pages 458 and 459) he mentions the possibility of accepting (!?) the null hypothesis.
That's an extremely dogmatic position. There are certainly situations where it is not useful or even accurate to proceed under the null hypothesis, but the converse is also very much true. Checking Wikipedia, I see they also describe the scenario as "accept or fail to reject".
If you want to argue for why "accept" is materially different from "fail to reject", feel free to do so - but I suggest that the chasm is by no means wide.
You can always make variables in any system non dimensional. The whole field of chemical engineering works on this principle. Once you do that, you are free of irregularities/issues that stem from scaling.
Whitening and (full rank) PCA use the same linear transformation except whitening scales the eigenvalues so all axes in the transformed system have unit variance.
In other words whitening the data before applying PCA should result in the same eigenvectors expressed in the original coordinate system.
Yes, but I think for many places where PCA is used, we are precisely interested in which eigenvectors have the largest eigenvalues. The scaling beforehand makes a unit change less likely to effect which eigenvectors are the most important.
But once you know the eigenvectors the eigenvalues (and hence their distribution) are determined, again in the original space.
On the other hand if you're talking about the eigenvalues for the whitened data they're all 1.
So I'm still not seeing what whitening adds to PCA.
Just rescaling each dimension of the original space so that all dimensions have unit variance, without doing any rotations, may change things, but I don't think that's what is usually called whitening (according to Wikipedia).
If PCA is not being used to reduce the dimensionality of multivariate data, this might be invalid. There are other uses of PCA besides working on data (that image reduction technique using SVD comes to mind) that he might be addressing.
If you want a treatment of PCA in a respected text (at least by the statistical community--not sure what the ML people think ;) ), look no further than Hastie, Tibshirani and Friedman's Elements of Stastical Learning.
http://www.stanford.edu/~hastie/local.ftp/Springer/OLD//ESLI...