Search code examples
rapache-sparkdplyrnse

Pass multiple column names in function to dplyr::distinct() with Spark


I want to specify an unknown number of column names in a function that will use dplyr::distinct(). My current attempt is:

myFunction <- function(table, id) {
  table %>%
    dplyr::distinct(.data[[id]])
}

I'm trying the above [.data[[id]]] because the data-masking section of this dplyr blog states:

When you have an env-variable that is a character vector, you need to index into the .data pronoun with [[, like summarise(df, mean = mean(.data[[var]])).

and the documentation for dplyr::distinct() says about its second argument:

<data-masking> Optional variables to use when determining uniqueness. If there are multiple rows for a given combination of inputs, only the first row will be preserved. If omitted, will use all variables.

Spark

More specifically, I'm trying to use this function with Spark.

sc <- sparklyr::spark_connect(local = "master")
mtcars_tbl <- sparklyr::copy_to(sc, mtcars, "mtcars_spark")

##### desired return
mtcars_tbl %>% dplyr::distinct(cyl, gear)
# Source: spark<?> [?? x 2]
    cyl  gear
  <dbl> <dbl>
1     6     4
2     4     4
3     6     3
4     8     3
5     4     3
6     4     5
7     8     5
8     6     5

##### myFunction fails
id = c("cyl", "gear")
myFunction(mtcars_tbl, id)
 Error: Can't convert a call to a string
Run `rlang::last_error()` to see where the error occurred. 

Following this comment, I have other failed attempts:

myFunction <- function(table, id) {
    table %>%
        dplyr::distinct(.dots = id)
}

myFunction(mtcars_tbl, id)
# Source: spark<?> [?? x 1]
  .dots           
  <list>          
1 <named list [2]>


#####


myFunction <- function(table, id) {
    table %>%
        dplyr::distinct_(id)
}

myFunction(mtcars_tbl, id)
Error in UseMethod("distinct_") : 
  no applicable method for 'distinct_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')"

Solution

  • Distinct applies to all columns of a table at once. Consider an example table:

    A     B
    1     4
    1     4
    2     3
    2     3
    3     3
    3     5
    

    It is not clear what applying distinct to only column A, but not column B should return. The following example is clearly not a good choice because it breaks the relationship between columns A and B. For example, there is no (A = 2, B = 4) row in the original dataset.

    A     B
    1     4
    2     4
    3     3
          3
          3
          5
    

    Hence the best approach is to select only those columns you want first, and then take the distinct. Something more like:

    myFunction <- function(table, id) {
      table %>%
        dplyr::select(dplyr::all_of(id)) %>%
        dplyr::distinct()
    }