Montecarlo Simulation using mob() algorithm (partykit package) to recover the count of the correctly identified models

I am using the mob() function from partykit package and I am getting some problems when parsing the obtained model.

My main aim is to check approximately how large a sample size needs to be in order to detect the real structure of a data generating process (DGP) when breaks are present.

The code below performs a Montecarlo simulation of data with breaks and tries to identify if the break was captured by the M-fluctuation test or not.

More specifically, I want to make a count on the number of times over the total number of simulations (nreps) that the model actually captures the DGP, conditional on a fixed sample size (N) to get a feeling of how many data should I need to capture the real DGP.

At the end of the code, you can see the list that I get out of my simulations. The problem is that I cannot recover the information displayed on the console.

Additionally, I have some doubts about how to make the count of the "correctly identified models". For now, what I have in mind is to count as positive if the model has a break into the correct variable (z1) at the specified region (z1>0) with some tolerance into the break region for example if the break is at z1>0.1 or z1>-0.4 it is also valid as a positive for me. Therefore, is there any simple way of counting the models that have the stated characteristics?

I hope my example is clear enough for you to help me out. Thank you a lot in advance.

Simulate model

Ingredients to generate the DGP.

library("partykit")
library(data.table)   ## optional, but what I'll use to coerce the list into a DT
library(future.apply) ## for parallel stuff
plan(multisession)    ## use all available cores

#sample size
N <- 300
#coeficients
betas <- list()

betas$b0 <- 1

betas$b1_up   <- 2.4
betas$b1_down <- 2

betas$b2_up   <- 2.4
betas$b2_down <- 2

#mob() ingredients
ols_formula <- y ~ x1 + x2 | z1 + z2
# the ""0 +"" --->  supress the 'double' interecept
ols <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {lm(y ~ 0 + x)}

Function that generates the data and fit OLS using mob algorithm.

reg_simulation_mob <- function(...){
  #data
  data <- data.frame(
    x1 = rnorm(N),
    x2 = rnorm(N),
    z1 = rnorm(N),
    z2 = rnorm(N),
    e  = rnorm(N))
  
  #dependent variable
  data$y <-   betas$b0 + with(data,  ifelse(z1>0,
                         betas$b1_up   * x1 + betas$b2_up   * x2 ,
                         betas$b1_down * x1 + betas$b2_down * x2 ) 
                        + e )
  
  
    #Estimate mob()-OLS
  ols_mob <- mob(ols_formula,
                 data = data, 
                 fit = ols)
  
  # return(ols$coefficients)
  return(ols_mob)
}

Montecarlo simulation (only 2 trials) of the above-described setup.

# N repetitions
nreps <- 2

## Parallel version
results  <- future_lapply(1:nreps, reg_simulation_mob, future.seed = 1234L)

Obtained result

As you can see below in the first trial (results[[1]]), the model finds the correct break but in the second (results[[2]]) it fails to find it.

> results
[[1]]
Model-based recursive partitioning (ols)

Model formula:
y ~ x1 + x2 | z1 + z2

Fitted party:
[1] root
|   [2] z1 <= 0.00029: n = 140
|       x(Intercept)          xx1          xx2 
|          0.9597894    1.7552122    2.1360788 
|   [3] z1 > 0.00029: n = 160
|       x(Intercept)          xx1          xx2 
|          0.9371795    2.4745728    2.5087608 

Number of inner nodes:    1
Number of terminal nodes: 2
Number of parameters per node: 3
Objective function: 422.2329

[[2]]
Model-based recursive partitioning (ols)

Model formula:
y ~ x1 + x2 | z1 + z2

Fitted party:
[1] root: n = 300
    x(Intercept)          xx1          xx2 
        1.015224     2.175625     2.200746  

Number of inner nodes:    0
Number of terminal nodes: 1
Number of parameters per node: 3
Objective function: 422.3085

In the picture below, you can observe the structure of the list results, where I cannot find the information displayed on the console (i.e. number of nodes, parameters, threshold values, etc..)

Solution

First, I would recommend to use the lmtree() function and not vanilla mob(). The former is faster, comes with better defaults for printing and plotting, and has more options for predictions.

Second, I recommend that you consult the vignette("partykit", package = "partykit") which explains how party objects are constructed and which classes and methods are involved.

Third, to determine which variable (if any) was used for splitting in the root node it is probably of interest to extract the results from all parameter instability tests. There is a dedicated sctest() (structural change test) method for obtaining this:

library("strucchange")
sctest(results[[1]], node = 1)
##                     z1        z2
## statistic 4.072483e+01 6.1762164
## p.value   5.953672e-07 0.9153013
sctest(results[[2]], node = 1)
##                   z1         z2
## statistic 12.2810548 10.1944484
## p.value    0.2165527  0.4179998

The corresponding partysplit object for the $split (if any) in the root $node is probably easiest to extract manually:

results[[1]]$node$split
## $varid
## [1] 4
## 
## $breaks
## [1] 0.0002853492
## 
## $index
## NULL
## 
## $right
## [1] TRUE
## 
## $prob
## NULL
## 
## $info
## NULL
## 
## attr(,"class")
## [1] "partysplit"
results[[2]]$node$split
## NULL

The variable id pertains to the order of the variables in:

names(results[[1]]$data)
## [1] "y"  "x1" "x2" "z1" "z2"

Finally, as for the question what to evaluate: Everything depends on identifying the correct variable for splitting. If this is done correctly, then the split point estimates converge fast to the true values, and then the parameter estimates also converge. See for example our recent arXiv paper (https://arxiv.org/abs/1906.10179) which contains a larger simulation study and also provides replication code.

Therefore, typically, I either evaluate the probability of selecting the correct variable in the first split. Or alternatively I look at the RMSE of the estimated vs.true coefficients for each observation.

Update: Beyond the root node you can use nodeapply() to extract information from various nodes. However, I do not recommend to evaluate all splits because it becomes increasingly difficult to match which estimated split matches which of the true splits. Instead, we often assess how similar the fitted partition is compared to the true partition, e.g., using the adjusted Rand Index. Again, you can find an example for the in the arXiv paper mentioned above.