Search code examples
rrpart

Getting back original names from rpart.object


I have saved models which were created using the rpart package in R. I am trying to retrieve some information from these saved models; specifically from rpart.object. While the documentation - rpart doc - is helpful there are a few things it is not clear about:

  1. How do I find out which variables are categorical and which are numeric? Currently, what I do is refer to the 'index' column in the splits matrix. I've noticed that for numeric variables only, the entry is not an integer. Is there a cleaner way to do this?
  2. The csplit matrix refers to the various values a categorical variable can take using integers i.e. R maps the original names to integers. Is there a way to access this mapping? For ex. if my original variable, say, Country can take any of the values France, Germany, Japan etc, the csplit matrix lets me know that a certain split is based on Country == 1, 2. Here, rpart has replaced references to France, Germany with 1, 2 respectively. How do I get the original names - France, Germany, Japan - back from the model file? Also, how do I know what the mapping between the names and the integers is?

Solution

  • Generally it is the terms component that would have that sort of information. See ?rpart::rpart.object.

    fit <- rpart::rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
    fit$terms  # notice that the attribute dataClasses has the information
    attr(fit$terms, "dataClasses")
    #------------
     Kyphosis       Age    Number     Start 
     "factor" "numeric" "numeric" "numeric" 
    

    That example doesn't have a csplit node in its structure because none of hte variables are factors. You could make one fairly easily:

    > fit <- rpart::rpart(Kyphosis ~ Age + factor(findInterval(Number,c(0,4,6,Inf))) + Start, data = kyphosis)
    > fit$csplit
         [,1] [,2] [,3]
    [1,]    1    1    3
    [2,]    1    1    3
    [3,]    3    1    3
    [4,]    1    3    3
    [5,]    3    1    3
    [6,]    3    3    1
    [7,]    3    1    3
    [8,]    1    1    3
    > attr(fit$terms, "dataClasses")
                                         Kyphosis 
                                         "factor" 
                                              Age 
                                        "numeric" 
    factor(findInterval(Number, c(0, 4, 6, Inf))) 
                                         "factor" 
                                            Start 
                                        "numeric" 
    

    The integers are just the values of the factor variables so the "mapping" is just the same as it would be from as.numeric() to the levels() of a factor. If I were trying to construct a character matrix version of the fit$csplit-matrix that substituted the names of the levels in a factor variable, this would be one path to success:

    > kyphosis$Numlev <- factor(findInterval(kyphosis$Number, c(0, 4, 6, Inf)), labels=c("low","med","high"))
    > str(kyphosis)
    'data.frame':   81 obs. of  5 variables:
     $ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
     $ Age     : int  71 158 128 2 1 1 61 37 113 59 ...
     $ Number  : int  3 3 4 5 4 2 2 3 2 6 ...
     $ Start   : int  5 14 5 1 15 16 17 16 16 12 ...
     $ Numlev  : Factor w/ 3 levels "low","med","high": 1 1 2 2 2 1 1 1 1 3 ...
    > fit <- rpart::rpart(Kyphosis ~ Age +Numlev + Start, data = kyphosis)
    > Levels <- fit$csplit
    > Levels[] <- levels(kyphosis$Numlev)[Levels]
    > Levels
         [,1]   [,2]   [,3]  
    [1,] "low"  "low"  "high"
    [2,] "low"  "low"  "high"
    [3,] "high" "low"  "high"
    [4,] "low"  "high" "high"
    [5,] "high" "low"  "high"
    [6,] "high" "high" "low" 
    [7,] "high" "low"  "high"
    [8,] "low"  "low"  "high"
    

    Response to comment: If you only have the model then use str() to look at it. I see an "ordered" leaf in the example I created that has the factor labels stored in an attribute named "xlevels":

    $ ordered            : Named logi [1:3] FALSE FALSE FALSE
      ..- attr(*, "names")= chr [1:3] "Age" "Numlev" "Start"
     - attr(*, "xlevels")=List of 1
      ..$ Numlev: chr [1:3] "low" "med" "high"
     - attr(*, "ylevels")= chr [1:2] "absent" "present"
     - attr(*, "class")= chr "rpart"