Why PCA works well while the total variance retained is small?

I'm learning machine learning by looking through other people's kernel on kaggle, specifically this Mushroom Classification kernel. The author first applyed PCA to the transformed indicator matrix. He only used 2 principal components for visualization later. Then I checked how much variance it has maintained, and found out that only 16% variance is maintained.

in [18]: pca.explained_variance_ratio_.cumsum()
out[18]: array([0.09412961, 0.16600686])

But the test result with 90% accuracy suggests it works well. My question is if variance stands for information, then how can ML model works well when so-much information has lost?

Solution

You should note that many of the variables in the original vector space are sparsely coded categorical variables. PCA is not best suited for such variables and the way it has been done in the code to which you're referring is not recommended.

Now your obvious question is: why it works in the first place? Why with two variables only? Ask yourself this: would you be able to tell if the mushroom is poisonous if I tell you the colour is red and presence of gill (lamella)? If you know anything about mushrooms, then yeah, in vast majority of cases you could tell. That's what the algorithm here is doing. There's not that much variance explained, as there are a lot of variables and some of the most meaningful, like colour, are sparsely coded, so effectively for PCA distributed over many variables.

Also, I would not say it works well, and the visualisation is exactly showing this. Consider this image that shows logistic regression test set results:

According to the test results, it has 90% accuracy. When you look at it, do you think it did well? In the bottom left corner there's a mix of edible and poisonous mushrooms. Apparently that's the place where our two computed features are not enough; ruby bolete is red and edible.