Search code examples
rtreerpart

Combining DF and rpart$where?


If I do DF$where <- tree$where after fitting an rpart object using DF as my data, will each row be mapped to its corresponding leaf through the column where?

Thanks!


Solution

  • As an example of how to demonstrate that this is possibly true (modulo my understanding of your question being correct), we work with the first example in ?rpart:

    require(rpart)
    fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
    kyphosis$where <- fit$where
    
    > str(kyphosis)
    'data.frame':   81 obs. of  5 variables:
     $ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
     $ Age     : int  71 158 128 2 1 1 61 37 113 59 ...
     $ Number  : int  3 3 4 5 4 2 2 3 2 6 ...
     $ Start   : int  5 14 5 1 15 16 17 16 16 12 ...
     $ where   : int  9 7 9 9 3 3 3 3 3 8 ...
    
    > plot(fit)
    > text(fit, use.n = TRUE)
    

    enter image description here

    And now look at some tables based on the 'where' vector and some logical tests:

    First node:

    > with(kyphosis, table(where, Start >= 8.5)) 
    
    
    where FALSE TRUE
        3     0   29
        5     0   12
        7     0   14
        8     0    7
        9    19    0  # so this is the row that describes that split
    > fit$frame[9,]
         var  n wt dev yval complexity ncompete nsurrogate   yval2.V1
    3 <leaf> 19 19   8    2       0.01        0          0  2.0000000
        yval2.V2   yval2.V3   yval2.V4   yval2.V5 yval2.nodeprob
    3  8.0000000 11.0000000  0.4210526  0.5789474      0.2345679
    

    Second node:

    > with(kyphosis, table(where, Start >= 8.5, Start>=14.5))
    , ,  = FALSE
    
    
    where FALSE TRUE
        3     0    0
        5     0   12
        7     0   14
        8     0    7
        9    19    0
    
    , ,  = TRUE
    
    
    where FALSE TRUE
        3     0   29
        5     0    0
        7     0    0
        8     0    0
        9     0    0
    

    And this is the row of fit$frame that describes the second split:

    > fit$frame[3,]
         var  n wt dev yval complexity ncompete nsurrogate   yval2.V1
    4 <leaf> 29 29   0    1       0.01        0          0  1.0000000
        yval2.V2   yval2.V3   yval2.V4   yval2.V5 yval2.nodeprob
    4 29.0000000  0.0000000  1.0000000  0.0000000      0.3580247
    

    So I would characterize the value of fit$where as describing the "terminal nodes" which are being labeled as '<leaf>', which may or not be what you were calling the "nodes".

    > fit$frame
          var  n wt dev yval complexity ncompete nsurrogate    yval2.V1
    1   Start 81 81  17    1 0.17647059        2          1  1.00000000
    2   Start 62 62   6    1 0.01960784        2          2  1.00000000
    4  <leaf> 29 29   0    1 0.01000000        0          0  1.00000000
    5     Age 33 33   6    1 0.01960784        2          2  1.00000000
    10 <leaf> 12 12   0    1 0.01000000        0          0  1.00000000
    11    Age 21 21   6    1 0.01960784        2          0  1.00000000
    22 <leaf> 14 14   2    1 0.01000000        0          0  1.00000000
    23 <leaf>  7  7   3    2 0.01000000        0          0  2.00000000
    3  <leaf> 19 19   8    2 0.01000000        0          0  2.00000000
          yval2.V2    yval2.V3    yval2.V4    yval2.V5 yval2.nodeprob
    1  64.00000000 17.00000000  0.79012346  0.20987654     1.00000000
    2  56.00000000  6.00000000  0.90322581  0.09677419     0.76543210
    4  29.00000000  0.00000000  1.00000000  0.00000000     0.35802469
    5  27.00000000  6.00000000  0.81818182  0.18181818     0.40740741
    10 12.00000000  0.00000000  1.00000000  0.00000000     0.14814815
    11 15.00000000  6.00000000  0.71428571  0.28571429     0.25925926
    22 12.00000000  2.00000000  0.85714286  0.14285714     0.17283951
    23  3.00000000  4.00000000  0.42857143  0.57142857     0.08641975
    3   8.00000000 11.00000000  0.42105263  0.57894737     0.23456790