Search code examples
rapache-sparksparklyr

Sparklyr Spark SQL Filter based on multiple wildcards


Using sparklyr, I'm trying to find a way to filter a Spark dataframe combining rlike and %in% functionality. Here is a minimal working example:

# start a Spark session in R and have dplyr loaded

# create a spark dataframe
df <- data.frame(names = c("Brandon", "Chee", "Brandi", "Firouz", "Eric", "Erin"),
                 place = c("Pasadena", "South Bay", "West Hollywood", "SF Valley", "South Bay", "South Bay"))
sc_df <- sdf_copy_to(sc, df, overwrite = TRUE)

# set wildcard filter paramaters
f_params <- c("Brand", "Er")

# return all rows of sc_df where the 'names' value contains either 'f_params' values.
df_filtered <- sc_df %>% 
  filter(rlike(names, f_params)) %>% 
  collect()

df_filtered in the code above obviously fails. Ideally, the df_filtered table will look like:

print(df_filtered)
# names     place           
# Brandon   Pasadena      
# Brandi    West Hollywood
# Eric      South Bay     
# Erin      South Bay 

Additional rule: Because the real example includes about 200 values in f_params, I can't use the following solution:

df_filtered <- sc_df %>% 
  filter(rlike(names, "Brand") | rlike(names, "Er")) %>% 
  collect()

Thanks, in advance.


Solution

  • I can't use multiple rlike() statements separated with | (OR) because the real example includes about 200 values in f_params

    That sounds like a rather artificial constraint, but if your really want to avoid a single regular expression you can always compose an explicit disjunction:

    library(rlang)
    
    sc_df %>% 
      filter(!!rlang::parse_quo(glue::glue_collapse(glue::glue(
        "(names %rlike% '{f_params}')"), 
        " %or% "  # or " | "
      ), rlang::caller_env()))
    
    # Source: spark<?> [?? x 2]
      names   place         
      <chr>   <chr>         
    1 Brandon Pasadena      
    2 Brandi  West Hollywood
    3 Eric    South Bay     
    4 Erin    South Bay 
    

    If f_params are guaranteed to be valid regexp literals it should be much faster to simply concatenate the string:

    sc_df %>% 
      filter(names %rlike% glue::glue_collapse(glue::glue("{f_params}"), "|"))
    
    # Source: spark<?> [?? x 2]
      names   place         
      <chr>   <chr>         
    1 Brandon Pasadena      
    2 Brandi  West Hollywood
    3 Eric    South Bay     
    4 Erin    South Bay  
    

    If not, you can try to apply Hmisc::escapeRegexp first:

    sc_df %>% 
      filter(
        names %rlike% glue::glue_collapse(glue::glue(
          "{Hmisc::escapeRegex(f_params)}"
        ), "|")
      )
    

    but please keep in mind that Spark uses Java regular expression, so it might not cover some edge cases.