I am currently working on Rstudio over a rhel cluster. I use spark 2.0.2 over a yarn client & have installed the following versions of sparklyr & dplyr
sparklyr_0.5.4 ; dplyr_0.5.0
A simple test on the following lines results in error
data = copy_to(sc, iris)
filter(data , Sepal_Length >5)
Error in filter(data, Sepal_Length > 5) :
(list) object cannot be coerced to type 'double'
I checked with the read & all looks fine
head(data)
Source: query [6 x 5]
Database: spark connection master=yarn-client app=sparklyr local=FALSE
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
is this a known bug & are there known fixes for this?
It's not a bug. You have to specify that you want to use the filter
function from the dplyr
package. Probably you are using the filter
function from the stats
package. That's why you get that error. You can specify the right version with this: dplyr::filter
res <- dplyr::filter(data, Sepal_Length > 5) %>% dplyr::collect()
head(res)
# A tibble: 6 x 5
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 5.4 3.9 1.7 0.4 setosa
3 5.4 3.7 1.5 0.2 setosa
4 5.8 4.0 1.2 0.2 setosa
5 5.7 4.4 1.5 0.4 setosa
6 5.4 3.9 1.3 0.4 setosa
To be sure, in the RStudio console, just type filter
(or any other function) and check the popup with the function name that appears. On the right, you can see the package that it's going to be used if you don't explicitly name the package with ::
.