I would like to partition a variable V2 using a variable V1. And V1 depends on a third variable V3.
In the following R code the variable V1 depends on a variable V3 such as when V3 equals 10 and V2 is 1, V1 equals 1.
Is there an algorithm which is able to do this ?
library(partykit)
set.seed(100)
V1<-sample(100);V2<-ifelse(V1>50,1,0);V3<-sample(1:10,100,replace=T);
V1[V3==10&V2==1]<-5
ctree(V2~V1+V3)
#ctree output :
V1<=50
___|___
| |
V1<=5 1
__|___
| |
V3<=6 0
___|___
| |
0.88 0.98
my_algorithm(V2~V1|V3)
#Expected output (optimal tree) :
V1>50
|
_______
| |
1 V3<10
|
_______
| |
0 1
For instance ctree is not giving an optimal classification (see above).
My question is probably unclear so feel free to edit it. Thank you.
I still don't fully understand the point of your question and hence probably don't have a complete answer. But I can make several comments:
(1) The situation you describe is a dependency between the regressors V1
and V3
. This is not the same as V2
depending on an interaction between V1
and V3
. The tree structure you show corresponds to the latter, not the former.
(2) The tree you display is not "optimal" because - due to (1) - there are still misclassifications in the second subgroup:
expected_tree <- ifelse(V1 > 50, "V1 > 50",
ifelse(V3 < 10, "V1 <= 50 & V3 < 10", "V1 <= 50 & V3 = 10"))
split(V2, expected_tree)
## $`V1 <= 50 & V3 < 10`
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0 0 0
##
## $`V1 <= 50 & V3 = 10`
## [1] 1 1 0 1 1 0 1 0 0 1 1 1
##
## $`V1 > 50`
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 1 1
(3) I cannot replicate your ctree()
result. It does find an interaction between V1
and V3
albeit at another cutoff in V3
- due to (2).
ctree(V2 ~ V1 + V3)
## Model formula:
## V2 ~ V1 + V3
##
## Fitted party:
## [1] root
## | [2] V1 <= 50
## | | [3] V3 <= 9: 0.000 (n = 46, err = 0.0)
## | | [4] V3 > 9: 0.667 (n = 12, err = 2.7)
## | [5] V1 > 50: 1.000 (n = 42, err = 0.0)
##
## Number of inner nodes: 2
## Number of terminal nodes: 3
Note that ctree()
thinks this is a regression problem because V2
is numeric. It would probably be more appropriate to code V2
as factor
. Then ctree()
will treat it as a classification problem and choose slightly different test statistics, and different printed and graphical displays.