I have a large dataframe to analyse, so I'm using sparklyr to manage it in a fast way. My goal is to take a sample of the data, but before I need to select some variables of interest and filter some values of certain columns. I tried to select and/or filter the data and then use the function sample_n but it always gives me this error:
Error in vapply(dots(...), escape_expr, character(1)) : values must be length 1, but FUN(X[[2]]) result is length 8
Below is an example of the behaviour:
library(sparklyr)
library(dplyr)
sc<-spark_connect(master='local')
data_example<-copy_to(sc,iris,'iris')
data_select<-select(data_example,Sepal_Length,Sepal_Width,Petal_Length)
data_sample<-sample_n(data_select,25)
data_sample
I don't know if I'm doing something wrong, since I started using this package a few days ago, but I could not find any solution to this problem. Any help with be appreciated!
It seemed a problem with the type of object returned when you select/mutate/filter the data. So, I managed to get around the problem by sending the data to spark using the compute() command, and then sampling the data.
library(sparklyr)
library(dplyr)
sc<-spark_connect(master='local')
data_example<-copy_to(sc,iris,'iris')
data_select<-data_example %>%
select(Sepal_Length,Sepal_Width,Petal_Length) %>%
compute('data_select')
data_sample<-sample_n(data_select,25)
data_sample
Unfortunatelly, this approach takes a long time to run and consumes a lot of memory, so I expect someday I'll find a better solution.