scikit-learn python-3.5 decision-tree cross-validation confusion-matrix

Training a decision tree using id3 algorithm by sklearn

I am trying to train a decision tree using the id3 algorithm. The purpose is to get the indexes of the chosen features, to esimate the occurancy, and to build a total confusion matrix.

The algorithm should split the dataset to training set, and a test set, and use cross validation with 4 folds.

I am new to the subject, I've read the tutorials on sklearn and theory about learning process, but I'm still very confused.

What I've tried doing:

from sklearn.model_selection import cross_val_predict,KFold,cross_val_score, 
train_test_split, learning_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix


X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
clf = DecisionTreeClassifier(criterion='entropy', random_state=0)
clf.fit(X_train,y_train)
results = cross_val_score(estimator=clf, X=X_train, y=y_train, cv=4)
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std()))
y_pred = cross_val_predict(estimator=clf, X=x, y=y, cv=4)
conf_mat = confusion_matrix(y,y_pred)
print(conf_mat)
dot_data = tree.export_graphviz(clf, out_file='tree.dot')

I have some questions:

How can I get a list of the feature indexes used in the training? Do I have to go through the tree in clf? Couldn't find any api method to retrieve them.
Do I have to use 'fit', 'cross_val_score', and 'cross_val_predict'? It seems that all of them do some kind of learning process, but I could't manage to get the clf fitted, the accurancy and the confusuin matrix from one of them only.
Do I have to use the test set for the estimation or the partitions of the folds of the dataset?

Solution

To retrieve the list of the features used in the training process you can just get the columns from the x in this way:

feature_list = x.columns

As you can know, not every feature can be useful in prediction. You can see this, after training the model, using

clf.feature_importances_

The index of a feature in the feature_list is the same as in the feature_importances list.

If you use cross validation, retrieving the scores can not be immediate.
cross_val_score made the deal but a better way to have the scores could be using cross_validate. It works in the same way as cross_val_score, but you can retrieve more scores values just creating every score you need with make_score and passing it, here an example:

from sklearn.model_selection import train_test_split,  cross_validate
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, make_scorer, recall_score 
import pandas as pd, numpy as np       

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
dtc = DecisionTreeClassifier()
dtc_fit = dtc.fit(x_train, y_train)

def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]

scoring = {
    'tp' : make_scorer(tp), 
    'tn' : make_scorer(tn), 
    'fp' : make_scorer(fp), 
    'fn' : make_scorer(fn), 
    'accuracy' : make_scorer(accuracy_score),
    'precision': make_scorer(precision_score),
    'f1_score' : make_scorer(f1_score),
    'recall'   : make_scorer(recall_score)
}

sc = cross_validate(dtc_fit, x_train, y_train, cv=5, scoring=scoring)

print("Accuracy: %0.2f (+/- %0.2f)" % (sc['test_accuracy'].mean(), sc['test_accuracy'].std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (sc['test_precision'].mean(), sc['test_precision'].std() * 2))
print("f1_score: %0.2f (+/- %0.2f)" % (sc['test_f1_score'].mean(), sc['test_f1_score'].std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (sc['test_recall'].mean(), sc['test_recall'].std() * 2), "\n")

stp = math.ceil(sc['test_tp'].mean())
stn = math.ceil(sc['test_tn'].mean())
sfp = math.ceil(sc['test_fp'].mean())
sfn = math.ceil(sc['test_fn'].mean())

confusion_matrix = pd.DataFrame(
    [[stn, sfp], [sfn, stp]],
    columns=['Predicted 0', 'Predicted 1'],
    index=['True 0', 'True 1']
)
print(conf_m)

When you use the cross_val functions, the function itself create the folds for the test and the training. If you want to manage the train fold and the test fold you can do it by yourself using the K_Fold class.
If you need to keep the class balancement, always need for a good scoring by a DecisionTreeClassifier, you have to use StratifiedKFold. If you want to shuffle randomly the values contained in the folds, you can use StratifiedShuffleSplit. Here an example:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, make_scorer, recall_score
import pandas as pd, numpy as np

precision = []; recall = []; f1score = []; accuracy = []

sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2)    
dtc = DecisionTreeClassifier()

for train_index, test_index in sss.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    dtc.fit(X_train, y_train)
    pred = dtc.predict(X_test)

    precision.append(precision_score(y_test, pred))
    recall.append(recall_score(y_test, pred))
    f1score.append(f1_score(y_test, pred))
    accuracy.append(accuracy_score(y_test, pred))   

print("Accuracy: %0.2f (+/- %0.2f)" % (np.mean(accuracy),np.std(accuracy) * 2))
print("Precision: %0.2f (+/- %0.2f)" % (np.mean(precision),np.std(precision) * 2))
print("f1_score: %0.2f (+/- %0.2f)" % (np.mean(f1score),np.std(f1score) * 2))
print("Recall: %0.2f (+/- %0.2f)" % (np.mean(recall),np.std(recall) * 2))

I hope I've answered everything you needed!