Search code examples
rdecision-treeparty

How do you use the partysplit function from partykit library to make a split with multiple factor levels in one child node


I am making a manual decision tree tool in R and am having trouble with categorical splits.

For a table df below I want to make a split on the variable cat1 such that levels 1, 2, and 5 are in child 1 and levels 3, and 4 are in child 2

Is there a way to use partysplit to specify this?

df <- data.frame(cat1 = rep(c('A','B','C','D','E'), times = 100))

# This will give 5 child nodes with one level in each node
split1 <- partysplit(varid = 1L, index = 1:5)

# This gives an error because you have to specify index numbers from 1:number of child nodes

split2 <- partysplit(varid = 1L, index = c(1, 2, 5))

Solution

  • For categorical variables it is easiest to simply set index to the vector of node IDs each of the levels should go to. In your case:

    split3 <- partysplit(varid = 1L, index = c(1L, 1L, 2L, 2L, 1L))
    

    The function character_split() can then be used to extract the variable name and generate suitable labels. This is convenient for checking whether you got the split right:

    character_split(split3, data = df)
    ## $name
    ## [1] "cat1"
    ## 
    ## $levels
    ## [1] "A, B, E" "C, D"