I'm using scikit learn's Logistic Regression for a multiclass problem.
logit = LogisticRegression(penalty='l1')
logit = logit.fit(X, y)
I'm interested in which features are driving this decision.
logit.coef_
The above gives me a beautiful dataframe in (n_classes, n_features)
format, but all the classes and feature names are gone. With features, that's okay, because making the assumption that they're indexed the same way as I passed them in seems safe...
But with classes, it's a problem, since I never explicitly passed in the classes in any order. So which class do coefficient sets (rows in the dataframe) 0, 1, 2, and 3 belong to?
The order will be same as returned by the logit.classes_
(classes_ is an attribute of the fitted model, which represents the unique classes present in y) and mostly they will be arranged alphabetically in case of strings.
To explain it, we the above mentioned labels y on an random dataset with LogisticRegression:
import numpy as np
from sklearn.linear_model import LogisticRegression
X = np.random.rand(45,5)
y = np.array(['GR3', 'GR4', 'SHH', 'GR3', 'GR4', 'SHH', 'GR4', 'SHH',
'GR4', 'WNT', 'GR3', 'GR4', 'GR3', 'SHH', 'SHH', 'GR3',
'GR4', 'SHH', 'GR4', 'GR3', 'SHH', 'GR3', 'SHH', 'GR4',
'SHH', 'GR3', 'GR4', 'GR4', 'SHH', 'GR4', 'SHH', 'GR4',
'GR3', 'GR3', 'WNT', 'SHH', 'GR4', 'SHH', 'SHH', 'GR3',
'WNT', 'GR3', 'GR4', 'GR3', 'SHH'], dtype=object)
lr = LogisticRegression()
lr.fit(X,y)
# This is what you want
lr.classes_
#Out:
# array(['GR3', 'GR4', 'SHH', 'WNT'], dtype=object)
lr.coef_
#Out:
# array of shape [n_classes, n_features]
So in the coef_
matrix, the index 0 in rows represents the 'GR3' (the first class in classes_
array, 1 = 'GR4' and so on.
Hope it helps.