Search code examples
pythonmachine-learningscikit-learndecision-tree

Decision Tree not capturing the variance of the dependent variable


I am working with decision tree regressors. The number of data points are 15000, with 15 features. The problem I am facing is that even under high over-fitting conditions (I made depth = 25, min.samples on leaves = 2), the predictions have much lower variance than the dependent variable (i.e. it still under-fits). First I thought this might be a bias variance problem, however the mean of all the predictions and mean of dependent variable are equal to 9 places in decimal.

i.e. it something looks like: enter image description here

As a result, the predictions and dependent variable have a view like: enter image description here

One reason I can think of is that features I chose might not be important at all. However they do make sense.

Can someone please explain what might be going wrong here? Any help shall be really appreciated. Thanks


Solution

  • The details of your own data aside, this in principle is not surprising behavior, once you understand what a decision tree is actually doing under the hood.

    What a regression tree actually returns as output is the mean value of the dependent variable y of the training samples that end up in the respective terminal nodes (leaves). Practically, this means that the output is by default discretized: the values you get at the output are among the finite set of values in the terminal nodes, without any interpolation between them whatsoever.

    Given that, intuitively it should not be that surprising that the variance of the predictions is lower than the actual values, the details of how much lower depending on the number of terminal nodes (i.e. max_depth), and of course the data themselves.

    The following plot from the documentation should help visualize the idea - it should be intuitively clear that the variance of the data is indeed higher than the one of the (discretised) predictions:

    enter image description here

    Let's adapt the code from that example, adding a few more outliers (which magnify the issue):

    import numpy as np
    from sklearn.tree import DecisionTreeRegressor
    
    # dummy data
    rng = np.random.RandomState(1)
    X = np.sort(5 * rng.rand(80, 1), axis=0)
    y = np.sin(X).ravel()
    y[::5] += 3 * (0.5 - 5*rng.rand(16)) # modify here - 5*
    
    estimator_1 = DecisionTreeRegressor(max_depth=2)
    estimator_1.fit(X, y)
    
    estimator_2 = DecisionTreeRegressor(max_depth=5)
    estimator_2.fit(X, y)
    
    y_pred_1 = estimator_1.predict(X)
    y_pred_2 = estimator_2.predict(X)
    

    Let's now check the variances:

    np.var(y) # true data
    # 11.238416688700267
    
    np.var(y_pred_1) # max_depth=2
    # 1.7423865989859313
    
    np.var(y_pred_2) # max_depth=5
    # 6.1398871265574595
    

    As expected, the variance of the predictions goes up with increasing tree depth, but it is still (significantly) lower than the one of the true data. While of course the mean values of all are the same:

    np.mean(y)
    # -1.2561013675900665
    
    np.mean(y_pred_1)
    # -1.2561013675900665
    
    np.mean(y_pred_2)
    # -1.2561013675900665
    

    All this may seem surprising to newcomers, especially if they try to "naively" extend the linear thinking of linear regression; but decision trees live in their own realm, which is certainly distinct (and rather far) from the linear one.

    To return to the discretization issue I opened the answer with, let's check how many unique values we get for our predictions; keeping the discussion only to y_pred_1 for simplicity:

    np.unique(y_pred_1)
    # array([-11.74901949,  -1.9966201 ,  -0.71895532])
    

    That's it; every output you will get from that regression tree will be one of these 3 values, and never anything "between", like -10, -5.82 or [...] (i.e. no interpolation). Now, again intuitively speaking at least, you should be able to convince yourself that the variance under such circumstances is unsurprisingly (much...) lower than the one of the actual data (the predictions are by default less dispersed)...