Search code examples
apache-sparkdplyrsparklyr

Type mismatch error for filter function with dplyr over a spark data frame


I am currently working on Rstudio over a rhel cluster. I use spark 2.0.2 over a yarn client & have installed the following versions of sparklyr & dplyr

sparklyr_0.5.4 ; dplyr_0.5.0

A simple test on the following lines results in error

data = copy_to(sc, iris)
filter(data , Sepal_Length >5)

Error in filter(data, Sepal_Length > 5) : 
(list) object cannot be coerced to type 'double'

I checked with the read & all looks fine

head(data)
Source:   query [6 x 5]
Database: spark connection master=yarn-client app=sparklyr local=FALSE

Sepal_Length Sepal_Width Petal_Length Petal_Width Species
     <dbl>       <dbl>        <dbl>       <dbl>   <chr>
1    5.1         3.5          1.4         0.2    setosa
2    4.9         3.0          1.4         0.2    setosa
3    4.7         3.2          1.3         0.2    setosa
4    4.6         3.1          1.5         0.2    setosa
5    5.0         3.6          1.4         0.2    setosa
6    5.4         3.9          1.7         0.4    setosa

is this a known bug & are there known fixes for this?


Solution

  • It's not a bug. You have to specify that you want to use the filter function from the dplyr package. Probably you are using the filter function from the stats package. That's why you get that error. You can specify the right version with this: dplyr::filter

    res <- dplyr::filter(data, Sepal_Length > 5) %>% dplyr::collect()
    head(res)
    # A tibble: 6 x 5
      Sepal_Length Sepal_Width Petal_Length Petal_Width Species
             <dbl>       <dbl>        <dbl>       <dbl>   <chr>
    1          5.1         3.5          1.4         0.2  setosa
    2          5.4         3.9          1.7         0.4  setosa
    3          5.4         3.7          1.5         0.2  setosa
    4          5.8         4.0          1.2         0.2  setosa
    5          5.7         4.4          1.5         0.4  setosa
    6          5.4         3.9          1.3         0.4  setosa
    

    To be sure, in the RStudio console, just type filter (or any other function) and check the popup with the function name that appears. On the right, you can see the package that it's going to be used if you don't explicitly name the package with ::.