Search code examples
rtidymodelsr-parsnipbart

Using tidymodels in R, my BART workflow changes after I fit it once. Why?


I have been trying to train a BART model using the tidymodels framework but I am running into some problems.

I can declare the model, the recipe, and the workflow alright, but once I fit the workflow, two unwanted things happen:

  1. The original model object (bart_mod below), initially correctly stored, becomes "call: NULL", even though I don't touch the model object directly (I assign nothing to the same object name).

  2. I am not able to retrieve any information about the fitted model. The bart_fit contains nothing and there seems to be no tidy method associated to it. All this is true even though I am able to predict values using the fitted model! (See last line of code in the reprex).

This may very well come from a misunderstanding of how all this works on my end, I am fairly new to tidymodels.

I would appreciate any help! Thank you.

library(tidyverse)
library(tidymodels)

set.seed(2022)

# Parameters --------------------------------------------------------------

n <- 5000

coef_x_var_1 <- 1
coef_x_var_2 <- 2
coef_x_var_3 <- 3

gen_y_1 <- function(data = dataset) {
  
  return(data$y_0 +
           data$x_var_1*coef_x_var_1 +
           data$x_var_2*coef_x_var_2 +
           data$x_var_3*coef_x_var_3 +
           rnorm(n = nrow(data), mean = 0, sd = 3)
  )}

# Data generation ---------------------------------------------------------

dataset <- matrix(NA, nrow = n, ncol = 3)

# Generate the unit-level moderators
dataset[,1] <- rnorm(mean = rnorm(n = 1), n = n)
dataset[,2] <- rnorm(mean = rnorm(n = 1), n = n)
dataset[,3] <- rnorm(mean = rnorm(n = 1), n = n)

# Change into dataframe
colnames(dataset) <- c("x_var_1", "x_var_2", "x_var_3")
dataset <- as_tibble(dataset)

# Make sure the variable format is numeric (except for the identifiers)
dataset$x_var_1 <- as.numeric(dataset$x_var_1)
dataset$x_var_2 <- as.numeric(dataset$x_var_2)
dataset$x_var_3 <- as.numeric(dataset$x_var_3)

# Generate the untreated potential outcomes
P0_coefs <- rdunif(n = 6,  1, 15)
dataset$y_0 <-
  dataset$x_var_1*P0_coefs[4] +
  dataset$x_var_2*P0_coefs[5] +
  dataset$x_var_3*P0_coefs[6] +
  rnorm(n = nrow(dataset), mean = 0, sd = 3)

dataset$y_1 <- gen_y_1(data = dataset)

# Create a variable to indicate treatment
treatment_group <- sample(1:nrow(dataset), size = nrow(dataset)/2)
# Indicate which potential outcome you observe
obs_dataset <- dataset |> 
  mutate(treated = ifelse(row_number() %in% treatment_group, 1, 0),
         obs_y = ifelse(treated, y_1, y_0))

y1_obs_dataset <- obs_dataset |> filter(treated == 1)
y0_obs_dataset <- obs_dataset |> filter(treated == 0)

# Analysis ----------------------------------------------------------------

covariates <- c("x_var_1", "x_var_2", "x_var_3")
bart_formula <- as.formula(paste0("obs_y ~ ", paste(covariates, collapse = " + ")))

# Create the workflow
bart_mod <- bart() |> 
  set_engine("dbarts") |> 
  set_mode("regression")

bart_recipe <- recipe(bart_formula, data = obs_dataset) |> 
  step_zv(all_predictors())

bart_workflow <- 
  workflow() |> 
  add_model(bart_mod) |> 
  add_recipe(bart_recipe)

# The workflow first looks right
bart_workflow
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bart()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_zv()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> BART Model Specification (regression)
#> 
#> Computational engine: dbarts

# Once I fit it though, the model part becomes call: NULL
bart_fit <- bart_workflow |> 
  fit(y1_obs_dataset)

# Nothing is stored in the fit
bart_fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bart()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_zv()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> 
#> Call:
#> `NULL`()

# The content of this object has changed!
bart_workflow
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: bart()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_zv()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> 
#> Call:
#> NULL

bart_fit |>
  extract_fit_parsnip(bart_fit)
#> parsnip model object
#> 
#> 
#> Call:
#> `NULL`()

# And yet, I am able to run a prediction using the fit!
predict(bart_fit, y0_obs_dataset)
#> # A tibble: 2,500 × 1
#>      .pred
#>      <dbl>
#>  1  -4.67 
#>  2  -6.23 
#>  3   6.35 
#>  4  10.7  
#>  5   4.90 
#>  6 -13.8  
#>  7   4.70 
#>  8  19.6  
#>  9  -0.907
#> 10   5.38 
#> # … with 2,490 more rows

Created on 2022-12-24 with reprex v2.0.2


Solution

  • First stripping Martin's code down to a smaller script:

    library(tidyverse)
    library(tidymodels)
    set.seed(2022)
    
    obs_dataset <- structure(list(x_var_1 = c(-0.273203786163623, 0.0026566250757164, 
                                              -0.544359413888551, 0.569128408034224, -2.00048700105319, -0.159113741655834
    ), obs_y = c(-8.14952415680873, 1.91364235165124, -7.68391811408719, 
                 -9.01497463720505, -18.5017189874949, -13.505685812581)), row.names = c(NA, 
                                                                                         -6L), class = c("tbl_df", "tbl", "data.frame"))
    bart_formula <- as.formula("obs_y ~ x_var_1")
    
    # Create the workflow
    bart_mod <- bart() |> 
      set_engine("dbarts") |> 
      set_mode("regression")
    
    bart_recipe <- recipe(bart_formula, data = obs_dataset)
    
    bart_workflow <- 
      workflow() |> 
      add_model(bart_mod) |> 
      add_recipe(bart_recipe)
    

    The workflow at first looks right

    bart_workflow
    
    > ══ Workflow
    > ════════════════════════════════════════════════════════════════
    > Preprocessor: Recipe Model: bart()
    > 
    > ── Preprocessor
    > ────────────────────────── 0 Recipe Steps
    > 
    > ── Model
    > ─────────────────────────────────────────────────────────
    > BART Model Specification (regression)
    > 
    > Computational engine: dbarts
    

    but this changes after fitting:

    bart_fit <- bart_workflow |> 
      fit(obs_dataset)
    bart_fit
    

    The workflow now displays NULL for the call, as does the model object.

    bart_workflow
    bart_mod
    
    ══ Workflow [trained] ══════════════════════════════════════════════════════
    Preprocessor: Recipe
    Model: bart()
    
    ── Preprocessor ─────────────────────────────────
    0 Recipe Steps
    
    ── Model ────────────────────────────────────────────────
    
    Call:
    `NULL`()
    

    All these display values:

    required_pkgs(bart_mod)
    print_model_spec(bart_mod)
    bart_mod[["engine"]]
    bart_mod[["mode"]]
    
    extract_recipe(bart_fit)
    extract_preprocessor(bart_fit)
    extract_mold(bart_fit)
    bart_fit[["fit"]][["fit"]][["spec"]][["engine"]]
    bart_fit[["fit"]][["fit"]][["spec"]][["mode"]]
    

    These display NULL:

    print(bart_mod)
    print(bart_workflow)
    print(bart_fit)
    extract_fit_engine(bart_fit)
    extract_fit_parsnip(bart_fit)
    extract_model(bart_fit)
    

    So, it seems that the model data is still in the objects, and is useable, but the print calls do not display it, and the extract functions do not display it.