Search code examples
rtidyversedisk.frame

How can I input a single additional parameter to disk.frame's inmapfn at readin?


According to the article https://diskframe.com/articles/ingesting-data.html a good use case for inmapfn as part of csv_to_disk_frame(...) is for date conversion. In my data I know the name of the date column at runtime and would like to feed in the date to a convert at read in time function. One issue I am having is that it doesn't seem any additional parameters can be passed into the inmapfn argument beyond the chunk itself. I can't use a hardcoded variable at runtime as the name of the column isn't known until runtime.

To clarify the issue is that the inmapfn seems to run in its own environment to prevent any data races/other parallelisation issues but I know the variable won't be changed so I am hoping there is someway to override this as I can make sure that this is safe.

I know the function I am calling works when called on an arbitrary dataframe.

I have provided a reproducible example below.

library(tidyverse)
library(disk.frame)

setup_disk.frame()

a <- tribble(~dates, ~val,
             "09feb2021", 2,
             "21feb2012", 2,
             "09mar2013", 3,
             "20apr2021", 4,
)

write_csv(a, "a.csv")

dates_col <- "dates"

tmp.df <- csv_to_disk.frame(
  "a.csv",
  outdir = file.path(tempdir(), "tmp.df"),
  in_chunk_size = 1L, 
  inmapfn = function(chunk) {
    chunk[, sdate := as.Date(do.call(`$`, list(chunk,dates_col)), "%d%b%Y")]
  }
)
#>  -----------------------------------------------------
#> Stage 1 of 2: splitting the file a.csv into smallers files:
#> Destination: C:\Users\joelk\AppData\Local\Temp\RtmpcFBBkr\file4a1876e87bf5
#>  -----------------------------------------------------
#> Stage 1 of 2 took: 0.020s elapsed (0.000s cpu)
#>  -----------------------------------------------------
#> Stage 2 of 2: Converting the smaller files into disk.frame
#>  -----------------------------------------------------
#> csv_to_disk.frame: Reading multiple input files.
#> Please use `colClasses = `  to set column types to minimize the chance of a failed read
#> =================================================
#> 
#>  -----------------------------------------------------
#> -- Converting CSVs to disk.frame -- Stage 1 of 2:
#> 
#> Converting 5 CSVs to 6 disk.frames each consisting of 6 chunks
#> 
#> Error in do.call(`$`, list(chunk, dates_col)): object 'dates_col' not found

Solution

  • You can experiment with different backend and chunk_reader arguments. For example, if you set the backend to readr, the inmapfn user defined function will have access to previously defined variables. Furthermore, readr will do column type guessing and will automatically impute Date type columns if it recognizes the string format as a date (in your example data it wouldn't recognize that as a date type, however).

    If you don't want to use the readr backend for performance reasons, then I would ask if your example correctly represents your actual scenario? I'm not seeing the need to pass in the date column as a variable in the example you provided.

    There is a working solution in the Just-in-time transformation section of the link you provided, and I'm not seeing any added complexities between that example and yours.

    If you really need to use the default backend and chunk_reader plan AND you really need to send the inmapfn function a previously defined variable, you can wrap the the csv_to_disk.frame call in a wrapper function:

    library(disk.frame)
    
    setup_disk.frame()
    
    df <- tribble(~dates, ~val,
                  "09feb2021", 2,
                  "21feb2012", 2,
                  "09mar2013", 3,
                  "20apr2021", 4,
    )
    
    write.csv(df, file.path(tempdir(), "df.csv"), row.names = FALSE)
    
    wrap_csv_to_disk <- function(col) {
      
      my_date_col <- col
      
      csv_to_disk.frame(
        file.path(tempdir(), "df.csv"), 
        in_chunk_size = 1L,
        inmapfn = function(chunk, dates = my_date_col) {
          chunk[, dates] <- lubridate::dmy(chunk[[dates]])
          chunk
        })
    }
    
    date_col <- "dates"
    
    df_disk_frame <- wrap_csv_to_disk(date_col)
    
    #> str(collect(df_disk_frame)$dates)
    # Date[1:4], format: "2021-02-09" "2012-02-21" "2013-03-09" "2021-04-20"