Search code examples
pythonmachine-learningscikit-learndecision-tree

100% accuracy with decision tree classifier using sklearn


I'm using Decision tree classifier from sklearn, but I'm getting 100% percent score and I don't know what is wrong. I have tested svm and knn and both give 60% to 80% accuracy and seem ok. Here is my code:

    from sklearn.tree import DecisionTreeClassifier
    maxScore = 0
    index = 0
    Depths = [1, 5, 10, 20, 40]
    for i,d in enumerate(Depths):
        clf1 = DecisionTreeClassifier(max_depth=d)
        score = cross_val_score(clf1, X_train, Y_train, cv=10).mean()     
        index = i if(score > maxScore) else index
        maxScore = max(score, maxScore)
        print('The cross val score for Decision Tree classifier (max_depth=' + str(d) + ') is ' + 
        str(score))

    d = Depths[index]
    print()
    print("So the best value for max_depth parameter is " + str(d))
    print()

    # Classifying
    clf1 = DecisionTreeClassifier(max_depth=d)
    clf1.fit(X_train, Y_train)
    preds = clf1.predict(X_valid)
    print(" The accuracy obtained using Decision tree classifier is {0:.8f}%".format(100* 
    (clf1.score(X_valid, Y_valid))))

and here is the output: The cross val score for Decision Tree classifier (max_depth=1) is 1.0

The cross value score for Decision Tree classifier (max_depth=5) is 0.9996212121212121

The cross val score for Decision Tree classifier (max_depth=10) is 1.0

The cross val score for Decision Tree classifier (max_depth=20) is 1.0

The cross val score for Decision Tree classifier (max_depth=40) is 0.9996212121212121

So the best value for the max_depth parameter is 1

The accuracy obtained using Decision tree classifier is 100.00000000%


Solution

  • I think there's an obvious conclusion: your labels have high correlation with some of the features, or at least with one of them. Maybe your data isn't very good.

    Anyway, you can check how a single feature split of your decision tree model affects on model prediction.

    Use model.feature_importances_ property to see how 'important' a feature is for the model prediction.

    Check the documentation Decision Tree Classifier.

    If you still consider your model prediction isn't good enough, I recommend you to change your model, use model with different approach. At least if you have to work with decision trees, you can try Random Forest Classifier.

    It is an ensemble model.The basic idea of ensemble learning is that the final model prediction is based on multiple weaker model predictions, weak learners. Check main approaches of making an ensemble models.

    In the case of Random Forest Classifier, weak learner models are Decision Trees with small depth. And Decision Trees are making predictions using only a few number of features, and every time features are chosen randomly.Number of chosen features is a hyper-parameter, so it needs to be tuned.

    Check the links and other tutorials for more information.