Search code examples
rtorchtargets-r-package

In R targets, cannot read target object of class "dataset"


I am struggling with interoperability of R packages torch and targets. For example, if I define a target of class dataset (from torch), then it is impossible to read it with tar_read (from targets), and I cannot use it in other targets.

Here is my dataset generator nn_dataset:

library(torch)
library(targets)
library(dplyr)
library(tidymodels)

nn_dataset <- 
  dataset(
    name = "nn_dataset",
    
    initialize = function(df) {
      data <- self$prepare_data(df)
      
      self$tele <- data$x$tele
      self$class <- data$x$class
      self$y <- data$y
    },
    
    .getitem = function(i) {
      list(
        x = list(
          tele = self$tele[i, ], 
          class = self$class[i, ]
        ), 
        y = self$y[i, ]
      )
    },
    
    .length = function() {
      self$y$size()[[1]]
    },
    
    prepare_data = function(df) {
      target_col <- 
        df$claim_ind_cov_1_2_3_4_5_6 %>% 
        as.integer() %>%
        `-`(1) %>%
        as.matrix()
      
      tele_cols <- 
        df %>%
        select(starts_with(c("h_", "p_", "vmo", "vma"))) %>%
        as.matrix()
    
      class_df <- select(df, expo:years_licensed, distance)
      
      rec_class <-
        recipe(~ ., data = class_df) %>%
        step_impute_median(commute_distance, years_claim_free) %>%
        step_other(all_nominal(), threshold = 0.05) %>%
        step_dummy(all_nominal()) %>%
        prep()

      class_cols <- juice(rec_class) %>% as.matrix()
      
      list(
        x = list(
          tele = torch_tensor(tele_cols),
          class = torch_tensor(class_cols)
        ),
        y = torch_tensor(target_col)
      )
    }
)

If I define the following target:

tar_target(
  name = target_name,
  command = nn_dataset(valid_df)
)

where valid_df is a tibble, and if I then try to read it:

tar_read(target_name)

then I get this error:

Error in cpp_tensor_dim(self$ptr) : external pointer is not valid

I have also tried this:

tar_target(
  name = target_name,
  command = nn_dataset(valid_df),
  format = "torch"  
)

and this:

tar_torch(
  name = target_name,
  command = nn_dataset(valid_df)
)

but neither worked.


Solution

  • The format = "torch" capability of targets relies on torch::torch_save() and torch::torch_load(), and these functions in torch do not work on the custom R6 classes that come out of MyDataset(mtcars) in your example. On top of that, torch data is "non-exportable", and as discussed at https://books.ropensci.org/targets/targets.html#saving and https://cran.r-project.org/web/packages/future/vignettes/future-4-non-exportable-objects.html, that data cannot simply be saved to disk with something like saveRDS() (which is the default in targets). I do not know torch well enough to recommend something specific, but a solution would require figuring out the R code that will safely save and load one of these objects, then creating your own custom storage format using tar_format(). The code at https://docs.ropensci.org/targets/reference/tar_format.html#ref-examples has an example for Keras models.

    A better alternative would actually be to avoid saving R6 objects altogether because those are really pieces of code that do not hash well. If you can restructure the pipeline to save simpler versions of the data and only re-create those R6 classes on an as-needed basis, that would be much better, especially if those R6 classes take no time at all to create from e.g. a data frame. So you first target could be the mtcars data frame, and then the model-fitting target could call MyDataset(mtcars), fit the model, and return easy-to-save output generated from that fitted model.