Search code examples
apache-sparkapache-spark-sqlsparkr

SparkR - Creating Test and Train DataFrames for Data Mining


I wish to partition a SparkR DataFrame into two subsets, one for training and one for testing of a glim.

My normal way of doing this in R is to create an array index of the rows, sample the array into a new array, and then subset the data based on the rows being in or not in the subset. e.g.

seed=42 # of course
index <- 1:nrow(df) # sample works on vectors
trainindex <- sample(index, trunc(length(index)/2)) # split data set into two
train <- df[trainindex, ] # training data set
test <- df[-trainindex, ] # all the records not in the training data set

This approach appears to not be applicable in SparkR DataFrames as the rows are not uniquely addressable as they are in R.

As partitioning of the data set is a fundamental technique to data mining, has anyone developed an approach to randomly partitioning the rows of a DataFrame?

Building on this idea, I seem to be continually switching backwards and forwards between R data.frames and Spark DataFrames as I work. It seems undesirable to fill up memory with multiple copies of similar data frames. Does anyone have good advice for a general approach on use of SparkR DataFrames for a data mining project? For example, perform all the tasks up to stage X using R data.frames, then switch to Spark DataFrames?


Solution

  • I found the answer to the first part of my question (the second part is taking a little longer). For those who follow...

    sdfData <- createDataFrame(sqlContext, index)
    train <- sample(sdfData, withReplacement=FALSE, fraction=0.5, seed=42)
    test <- except(sdfData, train)