I have saved models which were created using the rpart package in R. I am trying to retrieve some information from these saved models; specifically from rpart.object. While the documentation - rpart doc - is helpful there are a few things it is not clear about:
Country
can take any of the values France, Germany, Japan
etc, the csplit matrix lets me know that a certain split is based on Country == 1, 2
. Here, rpart has replaced references to France, Germany
with 1, 2
respectively. How do I get the original names - France, Germany, Japan
- back from the model file? Also, how do I know what the mapping between the names and the integers is?Generally it is the terms
component that would have that sort of information. See ?rpart::rpart.object
.
fit <- rpart::rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
fit$terms # notice that the attribute dataClasses has the information
attr(fit$terms, "dataClasses")
#------------
Kyphosis Age Number Start
"factor" "numeric" "numeric" "numeric"
That example doesn't have a csplit node in its structure because none of hte variables are factors. You could make one fairly easily:
> fit <- rpart::rpart(Kyphosis ~ Age + factor(findInterval(Number,c(0,4,6,Inf))) + Start, data = kyphosis)
> fit$csplit
[,1] [,2] [,3]
[1,] 1 1 3
[2,] 1 1 3
[3,] 3 1 3
[4,] 1 3 3
[5,] 3 1 3
[6,] 3 3 1
[7,] 3 1 3
[8,] 1 1 3
> attr(fit$terms, "dataClasses")
Kyphosis
"factor"
Age
"numeric"
factor(findInterval(Number, c(0, 4, 6, Inf)))
"factor"
Start
"numeric"
The integers are just the values of the factor variables so the "mapping" is just the same as it would be from as.numeric()
to the levels()
of a factor. If I were trying to construct a character matrix version of the fit$csplit
-matrix that substituted the names of the levels in a factor variable, this would be one path to success:
> kyphosis$Numlev <- factor(findInterval(kyphosis$Number, c(0, 4, 6, Inf)), labels=c("low","med","high"))
> str(kyphosis)
'data.frame': 81 obs. of 5 variables:
$ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
$ Age : int 71 158 128 2 1 1 61 37 113 59 ...
$ Number : int 3 3 4 5 4 2 2 3 2 6 ...
$ Start : int 5 14 5 1 15 16 17 16 16 12 ...
$ Numlev : Factor w/ 3 levels "low","med","high": 1 1 2 2 2 1 1 1 1 3 ...
> fit <- rpart::rpart(Kyphosis ~ Age +Numlev + Start, data = kyphosis)
> Levels <- fit$csplit
> Levels[] <- levels(kyphosis$Numlev)[Levels]
> Levels
[,1] [,2] [,3]
[1,] "low" "low" "high"
[2,] "low" "low" "high"
[3,] "high" "low" "high"
[4,] "low" "high" "high"
[5,] "high" "low" "high"
[6,] "high" "high" "low"
[7,] "high" "low" "high"
[8,] "low" "low" "high"
Response to comment: If you only have the model then use str() to look at it. I see an "ordered" leaf in the example I created that has the factor labels stored in an attribute named "xlevels":
$ ordered : Named logi [1:3] FALSE FALSE FALSE
..- attr(*, "names")= chr [1:3] "Age" "Numlev" "Start"
- attr(*, "xlevels")=List of 1
..$ Numlev: chr [1:3] "low" "med" "high"
- attr(*, "ylevels")= chr [1:2] "absent" "present"
- attr(*, "class")= chr "rpart"