Search code examples
pythonmachine-learningscikit-learndecision-treesklearn-pandas

How to get feature importance in Decision Tree?


I have a dataset of reviews which has a class label of positive/negative. I am applying Decision Tree to that reviews dataset. Firstly, I am converting into a Bag of words. Here sorted_data['Text'] is reviews and final_counts is a sparse matrix.

I am splitting the data into train and test dataset.

X_tr, X_test, y_tr, y_test = cross_validation.train_test_split(sorted_data['Text'], labels, test_size=0.3, random_state=0)

# BOW
count_vect = CountVectorizer() 
count_vect.fit(X_tr.values)
final_counts = count_vect.transfrom(X_tr.values)

applying the Decision Tree algorithm as follows

# instantiate learning model k = optimal_k
# Applying the vectors of train data on the test data
optimal_lambda = 15
final_counts_x_test = count_vect.transform(X_test.values)
bow_reg_optimal = DecisionTreeClassifier(max_depth=optimal_lambda,random_state=0)

# fitting the model
bow_reg_optimal.fit(final_counts, y_tr)

# predict the response
pred = bow_reg_optimal.predict(final_counts_x_test)

# evaluate accuracy
acc = accuracy_score(y_test, pred) * 100
print('\nThe accuracy of the Decision Tree for depth = %f is %f%%' % (optimal_lambda, acc))

bow_reg_optimal is a decision tree classifier. Could anyone tell how to get the feature importance using the decision tree classifier?


Solution

  • Use the feature_importances_ attribute, which will be defined once fit() is called. For example:

    import numpy as np
    X = np.random.rand(1000,2)
    y = np.random.randint(0, 5, 1000)
    
    from sklearn.tree import DecisionTreeClassifier
    
    tree = DecisionTreeClassifier().fit(X, y)
    tree.feature_importances_
    # array([ 0.51390759,  0.48609241])