R automatically apply the preprocessing realized on a dataset to a new dataset

I am looking for a way to write a fonction that loads automatically a part of a script in an other script.

Here my problem: I have created a script performing preprocessing on a dataset, then applying an xgboost.

I need to automatically apply the preprocessing realized (example: creation of new variables, replacement of NA by the mean - keeping the mean of the initial dataset) in this script to a new dataset. This should be totally transparent for users (no copy paste, only a function with the new set and a Rdata of the model to enter as arguments).

My idea was to "store" the part of the script with the preprocessing as an object in the Rdata, then when I load this object in the new script, the preprocessing is applied to the newdataset.

Does anybody has an idea of the way to do it?

Solution

It sounds like you're trying to implement a stable pipeline in R: save all preprocessing, transformation, and prediction steps for a big-data prediction implementation in one place.

While i currently would recommend using a dedicated pipelining tool and using that to call an Rscript in stead, there are some R packages that try to provide pipelining syntax, like flowr.

As you're doing an xgboost you may be able to leverage Spark-ML's pipeline syntax via sparklyr as an intermediate solution, but it's still being very actively developed so may not yet work entirely as expected.

The open standard for saving ad sharing pipelines is pmml, and most frameworks have a way to export pipelines to pmml (R has the pmml package), but not to import them.

eta: for completeness, you can also try to wrap the necessary datastructure and the trained model object for each of your trained models into an S4 class and define (highly specific) preprocess(), transform(), and predict() methods. I've done this for private use, but to me it has a bit too much of a ducttape-and-tiewrap feel to expose it to clients.