Search code examples
rmicrosoft-r

splitting a XDF File / Dataset for training and testing


Is it possible to split a .xdf file in (the Microsoft RevoScaleR context) into a let's say 75% training and 25% test set? I know there is a function called rxSplit(), but, the documentation doesn't seem to apply to this case. Most of the examples online assign a column of random numbers to the dataset, and split it using that column.

Thanks. Thomas


Solution

  • You can certainly use rxSplit for this. Create a variable that defines your training and test samples, and then split on it.

    For example, using the mtcars toy dataset:

    xdf <- rxDataStep(mtcars, "mtcars.xdf")
    xdfList <- rxSplit(xdf, splitByFactor="test",
        transforms=list(test=factor(runif(.rxNumRows) < 0.25, levels=c("FALSE", "TRUE"))))
    

    xdfList is now a list containing 2 xdf data sources: one with (approximately) 75% of the data, and the other with 25%.