Search code examples
rmicrosoft-r

Questions on rxDataStep when using 'transformFunc'


The following R code is to add one column to the dataset and return the data.frame.

xdfAirDemo <- RxXdfData(file.path(rxGetOption("sampleDataDir"),  "AirlineDemoSmall.xdf"))

I add a print function to check the length of the vector.

f.append <- function(lst){
  lst$mod_val_test <- rep(1, length(lst[[1]]))
  print(length(lst$mod_val_test))
  return(lst)
}

df.Airline <- rxDataStep(inData = xdfAirDemo, transformFunc = f.append)

When I run the above rxDatastep , the print function in the 'f.append' function was executed twice and output two values. Can someone help me to understand how the rxDatastep works?

The result show as below. [1] 10

[1] 600000

Rows Read: 600000, Total Rows Processed: 600000, Total Chunk Time: 0.651 seconds


Solution

  • When you call rxDataStep, it actually runs your code on the first 10 rows of the data as a test. If this succeeds, it then processes the entire dataset one chunk at a time.

    If you don't want your code to be executed in the test run, you can check the value of the .rxIsTestChunk builtin variable:

    f.append <- function(lst)
    {
        # don't print anything if this is the test chunk
        if(.rxIsTestChunk)
            return(NULL)
    
        lst$mod_val_test <- rep(1, length(lst[[1]]))
        print(length(lst$mod_val_test))
        return(lst)
    }