Search code examples
rtreepartitioningrpartparty

Conditional partitioning


I would like to partition a variable V2 using a variable V1. And V1 depends on a third variable V3.

In the following R code the variable V1 depends on a variable V3 such as when V3 equals 10 and V2 is 1, V1 equals 1.

Is there an algorithm which is able to do this ?

library(partykit)
set.seed(100)
V1<-sample(100);V2<-ifelse(V1>50,1,0);V3<-sample(1:10,100,replace=T);
V1[V3==10&V2==1]<-5

ctree(V2~V1+V3)
#ctree output :

      V1<=50
      ___|___  
      |     |
    V1<=5   1
   __|___
   |     |
 V3<=6   0
 ___|___  
 |      | 
0.88  0.98

my_algorithm(V2~V1|V3) 
#Expected output (optimal tree) :

 V1>50
   |
_______
|     |
1     V3<10
        |
     _______
     |     |
     0     1

For instance ctree is not giving an optimal classification (see above).

My question is probably unclear so feel free to edit it. Thank you.


Solution

  • I still don't fully understand the point of your question and hence probably don't have a complete answer. But I can make several comments:

    (1) The situation you describe is a dependency between the regressors V1 and V3. This is not the same as V2 depending on an interaction between V1 and V3. The tree structure you show corresponds to the latter, not the former.

    (2) The tree you display is not "optimal" because - due to (1) - there are still misclassifications in the second subgroup:

    expected_tree <- ifelse(V1 > 50, "V1 > 50",
      ifelse(V3 < 10, "V1 <= 50 & V3 < 10", "V1 <= 50 & V3 = 10"))
    split(V2, expected_tree)
    ## $`V1 <= 50 & V3 < 10`
    ##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    ## [39] 0 0 0 0 0 0 0 0
    ## 
    ## $`V1 <= 50 & V3 = 10`
    ##  [1] 1 1 0 1 1 0 1 0 0 1 1 1
    ## 
    ## $`V1 > 50`
    ##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    ## [39] 1 1 1 1
    

    (3) I cannot replicate your ctree() result. It does find an interaction between V1 and V3 albeit at another cutoff in V3 - due to (2).

    ctree(V2 ~ V1 + V3)
    ## Model formula:
    ## V2 ~ V1 + V3
    ## 
    ## Fitted party:
    ## [1] root
    ## |   [2] V1 <= 50
    ## |   |   [3] V3 <= 9: 0.000 (n = 46, err = 0.0)
    ## |   |   [4] V3 > 9: 0.667 (n = 12, err = 2.7)
    ## |   [5] V1 > 50: 1.000 (n = 42, err = 0.0)
    ## 
    ## Number of inner nodes:    2
    ## Number of terminal nodes: 3
    

    Note that ctree() thinks this is a regression problem because V2 is numeric. It would probably be more appropriate to code V2 as factor. Then ctree() will treat it as a classification problem and choose slightly different test statistics, and different printed and graphical displays.