Search code examples
pythonmachine-learningscikit-learndata-sciencedimensionality-reduction

How to evaluate the information retained in UMAP?


I tried to find an attribute similar to explained_variance_ratio (in PCA in sklearn) for UMAP but am unable to find such a thing. In PCA, I could use explained_variance_ratio for different values of n_components and compare the results. Is there any such thing that I can use for UMAP in python?


Solution

  • You cannot easily estimate the variance explained by UMAP because it is a form of nonlinear dimension reduction, compared to PCA. Below is a more detailed dive.

    PCA tries to find projections in the high-dimensional space that captures as much variance as possible. You project data onto these orthogonal planes, and you can estimate the variance captured by each, as compared to the variance in the original data. It is throughout, a linear operation, so you define the variance explained. You can check out this post about variance explained or this about PCA

    UMAP is a form of nonlinear dimension reduction. From the help page, UMAP uses so called simplicial complexes to capture the topological space of your features, and from there obtain a low dimensional reduction. You can think of it as a high dimensionl graph that more geared towards capturing the inter-connectedness between data points than the variance. Hence, as of now, I am not aware of a way to retrieve the variance explained in a UMAP. You can also check out the author's reply on github.