Search code examples
pythonmachine-learningdecision-tree

Question regarding DecisionTreeClassifier


I am making an explainable model with the past data, and not going to use it for future prediction at all.

In the data, there are a hundred X variables, and one Y binary class and trying to explain how Xs have effects on Y binary (0 or 1).

I came up with DecisionTree classifier as it clearly shows us that how decisions are made by value criterion of each variable

Here are my questions:

  1. Is it necessary to split X data into X_test, X_train even though I am not going to predict with this model? ( I do not want to waste data for the test since I am interpreting only)

  2. After I split the data and train model, only a few values get feature importance values (like 3 out of 100 X variables) and rest of them go to zero. Therefore, there are only a few branches. I do not know reason why it happens.

If here is not the right place to ask such question, please let me know.

Thanks.


Solution

    1. No it is not necessary but it is a way to check if your decision tree is overfitting and just remembering the input values and classes or actually learning the pattern behind it. I would suggest you look into cross-validation since it doesn't 'waste' any data and trains and tests on all the data. If you need me to explain this further, leave a comment.

    2. Getting any number of important features is not an issue since it does depend very solely on your data.
      Example: Let's say I want to make a model to tell if a number will be divisible by 69 (my Y class).
      I have my X variables as divisibility by 2,3,5,7,9,13,17,19 and 23. If I train the model correctly, I will get feature importance of only 3 and 23 as very high and everything else should have very low feature importance.
      Consequently, my decision tree (trees if using ensemble models like Random Forest / XGBoost) will have less number of splits. So, having less number of important features is normal and does not cause any problems.