Search code examples
rapache-sparkdplyrsparklyr

sparklyr can't filter missing value of `sd` on single value


Applying sd() to a single value in a spark data frame (via sparklyr package in R) results in a missing value that cannot be filtered out based on it being a missing value.

Can someone explain this / provide a good solution?

Example below.


library(sparklyr)
library(dplyr)

sc <- spark_connect(master = "local")
#> * Using Spark: 2.1.0

x <- data.frame(grp = c("a", "a", "c"), x = c(1, 2, 3))

copy_to(sc, x, "tmp", overwrite = TRUE)
#> # Source:   table<tmp> [?? x 2]
#> # Database: spark_connection
#>     grp     x
#>   <chr> <dbl>
#> 1     a     1
#> 2     a     2
#> 3     c     3

x_tbl <- tbl(sc, "tmp") %>% group_by(grp) %>% mutate(x_sd = sd(x))

x_tbl
#> # Source:   lazy query [?? x 3]
#> # Database: spark_connection
#> # Groups:   grp
#>     grp     x      x_sd
#>   <chr> <dbl>     <dbl>
#> 1     a     1 0.7071068
#> 2     a     2 0.7071068
#> 3     c     3       NaN

x_tbl %>% filter(!is.na(x_sd)) %>% collect()
#> # A tibble: 3 x 3
#> # Groups:   grp [2]
#>     grp     x      x_sd
#>   <chr> <dbl>     <dbl>
#> 1     a     1 0.7071068
#> 2     a     2 0.7071068
#> 3     c     3       NaN

Solution

  • This is a matter of incompatibility between sparklyr and Spark. In Spark there are both NULLS (somewhat equivalent to R NA) and NaNs, each with different processing rules, but both values are fetched as NaN in sparklyr.

    To filter out NaN you have to use isnan (don't confuse it with R is.nan):

    x_tbl %>% filter(!isnan(x_sd)) %>% collect()
    
    # A tibble: 2 x 3
    # Groups:   grp [1]
        grp     x      x_sd
      <chr> <dbl>     <dbl>
    1     a     1 0.7071068
    2     a     2 0.7071068
    

    To better illustrate the problem:

    df <- copy_to(sc,
      data.frame(x = c("1", "NaN", "")), "df", overwrite = TRUE
    ) %>% mutate(x = as.double(x))
    
    df %>% mutate_all(funs(isnull, isnan)) 
    
    # Source:   lazy query [?? x 3]
    # Database: spark_connection
          x isnull isnan
      <dbl>  <lgl> <lgl> 
    1     1  FALSE FALSE
    2   NaN  FALSE  TRUE
    3   NaN   TRUE FALSE