Using sparklyr, I'm trying to find a way to filter a Spark dataframe combining rlike
and %in%
functionality. Here is a minimal working example:
# start a Spark session in R and have dplyr loaded
# create a spark dataframe
df <- data.frame(names = c("Brandon", "Chee", "Brandi", "Firouz", "Eric", "Erin"),
place = c("Pasadena", "South Bay", "West Hollywood", "SF Valley", "South Bay", "South Bay"))
sc_df <- sdf_copy_to(sc, df, overwrite = TRUE)
# set wildcard filter paramaters
f_params <- c("Brand", "Er")
# return all rows of sc_df where the 'names' value contains either 'f_params' values.
df_filtered <- sc_df %>%
filter(rlike(names, f_params)) %>%
collect()
df_filtered
in the code above obviously fails. Ideally, the df_filtered
table will look like:
print(df_filtered)
# names place
# Brandon Pasadena
# Brandi West Hollywood
# Eric South Bay
# Erin South Bay
Additional rule: Because the real example includes about 200 values in f_params
, I can't use the following solution:
df_filtered <- sc_df %>%
filter(rlike(names, "Brand") | rlike(names, "Er")) %>%
collect()
Thanks, in advance.
I can't use multiple rlike() statements separated with | (OR) because the real example includes about 200 values in f_params
That sounds like a rather artificial constraint, but if your really want to avoid a single regular expression you can always compose an explicit disjunction:
library(rlang)
sc_df %>%
filter(!!rlang::parse_quo(glue::glue_collapse(glue::glue(
"(names %rlike% '{f_params}')"),
" %or% " # or " | "
), rlang::caller_env()))
# Source: spark<?> [?? x 2]
names place
<chr> <chr>
1 Brandon Pasadena
2 Brandi West Hollywood
3 Eric South Bay
4 Erin South Bay
If f_params
are guaranteed to be valid regexp literals it should be much faster to simply concatenate the string:
sc_df %>%
filter(names %rlike% glue::glue_collapse(glue::glue("{f_params}"), "|"))
# Source: spark<?> [?? x 2]
names place
<chr> <chr>
1 Brandon Pasadena
2 Brandi West Hollywood
3 Eric South Bay
4 Erin South Bay
If not, you can try to apply Hmisc::escapeRegexp
first:
sc_df %>%
filter(
names %rlike% glue::glue_collapse(glue::glue(
"{Hmisc::escapeRegex(f_params)}"
), "|")
)
but please keep in mind that Spark uses Java regular expression, so it might not cover some edge cases.