The following R code is to add one column to the dataset and return the data.frame.
xdfAirDemo <- RxXdfData(file.path(rxGetOption("sampleDataDir"), "AirlineDemoSmall.xdf"))
I add a print function to check the length of the vector.
f.append <- function(lst){
lst$mod_val_test <- rep(1, length(lst[[1]]))
print(length(lst$mod_val_test))
return(lst)
}
df.Airline <- rxDataStep(inData = xdfAirDemo, transformFunc = f.append)
When I run the above rxDatastep , the print function in the 'f.append' function was executed twice and output two values. Can someone help me to understand how the rxDatastep works?
The result show as below. [1] 10
[1] 600000
Rows Read: 600000, Total Rows Processed: 600000, Total Chunk Time: 0.651 seconds
When you call rxDataStep
, it actually runs your code on the first 10 rows of the data as a test. If this succeeds, it then processes the entire dataset one chunk at a time.
If you don't want your code to be executed in the test run, you can check the value of the .rxIsTestChunk
builtin variable:
f.append <- function(lst)
{
# don't print anything if this is the test chunk
if(.rxIsTestChunk)
return(NULL)
lst$mod_val_test <- rep(1, length(lst[[1]]))
print(length(lst$mod_val_test))
return(lst)
}