Search code examples
sparkr

How to make a new DataFrame in sparkR


In sparkR I have data as a DataFrame. I can attach one entry in data like this:

newdata <- filter(data, data$column == 1)

How can I attach more than just one?
Say I want to attach all elements in the vector list <- c(1,6,10,11,14) or if list is a DataFrame 1 6 10 11 14.

newdata <- filter(data, data$column == list)

If I do it like this I get an error.


Solution

  • If you are ultimately trying to filter a spark DataFrame by a list of unique values, you can do this with a merge operation. If you are talking about going from a long to a wide data format, you need to ensure there are the same number of observations for each 'level' of the factor variable you are considering. If you want to subset a Spark dataframe by columns, you could also use a select statement, or build up a select statement by pasting data$blah into and then do the eval(parse(text=bigTextObject)) as @Wannes suggested. Maybe a function that generates a big select statement is what you want (if you are filtering by column name)...a merge is what you want if you are trying to extract values from a single column.

    From what I understand, it seems as if you want to take a big Spark DataFrame with lots of columns and only take the ones you are interested in, as indicated by list in your question.

    Here is a little function to generate the spark select statement:

    list<- c(1,2,5,8,90,200)
    listWithDataPrePended<- paste0('data', '$', list)
    gettingCloser<- noquote(paste0(listWithDataPrePended, collapse = ','))
    finalSelectStatement<- noquote(paste("select(data,", gettingCloser, ")"))
    finalData<- eval(parse(text=finalSelectStatement))
    finalData<- SparkR::collect(finalData)
    

    Maybe this is what you're looking for...maybe not. Nonetheless, I hope it's helpful.

    Good luck, nate