Search code examples
rfiltersamplesparklyr

Sample data after using filter or select from sparkly


I have a large dataframe to analyse, so I'm using sparklyr to manage it in a fast way. My goal is to take a sample of the data, but before I need to select some variables of interest and filter some values of certain columns. I tried to select and/or filter the data and then use the function sample_n but it always gives me this error:

Error in vapply(dots(...), escape_expr, character(1)) : values must be length 1, but FUN(X[[2]]) result is length 8

Below is an example of the behaviour:

library(sparklyr)
library(dplyr)

sc<-spark_connect(master='local')

data_example<-copy_to(sc,iris,'iris')

data_select<-select(data_example,Sepal_Length,Sepal_Width,Petal_Length)
data_sample<-sample_n(data_select,25)

data_sample

I don't know if I'm doing something wrong, since I started using this package a few days ago, but I could not find any solution to this problem. Any help with be appreciated!


Solution

  • It seemed a problem with the type of object returned when you select/mutate/filter the data. So, I managed to get around the problem by sending the data to spark using the compute() command, and then sampling the data.

    library(sparklyr)
    library(dplyr)
    
    sc<-spark_connect(master='local')
    
    data_example<-copy_to(sc,iris,'iris')
    
    data_select<-data_example %>% 
      select(Sepal_Length,Sepal_Width,Petal_Length) %>% 
      compute('data_select')
    
    data_sample<-sample_n(data_select,25)
    
    data_sample
    

    Unfortunatelly, this approach takes a long time to run and consumes a lot of memory, so I expect someday I'll find a better solution.