python pandas scikit-learn sklearn-pandas

How can I get the feature names from sklearn TruncatedSVD object?

I have the following code

import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
df = df = pd.DataFrame(np.random.randn(1000, 25), index=dates, columns=list('ABCDEFGHIJKLMOPQRSTUVWXYZ'))

def reduce(dim):
    svd = sklearn.decomposition.TruncatedSVD(n_components=dim, n_iter=7, random_state=42)
    return svd.fit(df)

fitted = reduce(5)

how do i get the column names from fitted?

Solution

fitted column names would be SVD dimensions.

Each dimension is a linear combination of input features. To understand what a particular dimension mean take a look at svd.components_ array - it contains a matrix of coefficients input features are multiplied by.

Your original example, slightly changed:

import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD

feature_names = list('ABCDEF')
df = pd.DataFrame(
    np.random.randn(1000, len(feature_names)), 
    columns=feature_names
)

def reduce(dim):
    svd = TruncatedSVD(n_components=dim, n_iter=7, random_state=42)
    return svd.fit(df)

svd = reduce(3)

Then you can do something like that to get a more readable SVD dimension name - let's compute it for 0th dimension:

" ".join([
    "%+0.3f*%s" % (coef, feat) 
    for coef, feat in zip(svd.components_[0], feature_names)
])

It shows +0.170*A -0.564*B -0.118*C +0.367*D +0.528*E +0.475*F - this is a "feature name" you can use for a 0th SVD dimension in this case (of course, coefficients depend on data, so feature name also depends on data).

If you have many input dimensions you may trade some "precision" with inspectability, e.g. sort coefficients and use only a few top of them. A more elaborate example can be found in https://github.com/TeamHG-Memex/eli5/pull/208 (disclaimer: I'm one of eli5 maintainers; pull request is not by me).