I am in a discussion on whether you can save disk space by doing PCA on your data. Suppose you have a covariance matrix and your data vectors are of length 1000. The compression method to cut space by 50% would be:
Me: This doesnt save any space for the vectors because there will still be nonzero elements in all the 1000 components after rotation. There is no compression. The data are probably simplified but that is a different thing. Him: just take the first 500 elements in the result - that is your "compression".
I know I am right but plenty of people say in the literature that they are doing compression with PCA - here is an example:
http://ufldl.stanford.edu/tutorial/unsupervised/PCAWhitening/
I think that this tutorial is mostly right and is a nice description but the conclusion on compression is wrong. But how could something so obvious be overlooked by people who clearly work with data. Makes me things that I am wrong.
Can anyone help me understand their viewpoint?
In my opinion:
1- Yes, you can compress data by PCA because the dimension of the vectors (each one) you have to store is less than the original. Of course, you have to store the matrix to decompress the data too, but if your original dataset is enough large, this is insignificant to the data itself.
2- Of course there is a drawback. The compression is not lossless. You lose the original data forever, and your new version after decompression won't be exactly the same as the original. It will be an approximation.
At this point here's my advice:
If you have a lot of data with the same form (vectors of the same dimension...), your interest in this data is qualitative (you don't care the exact number itself, only the approximate number) and some of the data shows collinearity (dependency between vectors), PCA is a way to save storage space.
It is imperative to check if you lose the variance of the original data or not, because this is the signal you are choosing too much compression.
Anyway, the main purpose of PCA is not saving storage space... it is to do heavy operations with the data quicker to obtain a very similar result.
I hope this is helpful for you.