Search code examples
rapache-sparkstringrsparklyr

Alternative for ``stringr::str_detect`` when working in Spark


I've worked in RStudio on a local device for a couple of years and I recently started working with Spark (version 3.0.1). I ran into an unexpected problem when I tried to run stringr::str_detect() in Spark. Apparently str_detect() does not have an equivalent in SQL. I am looking for an alternative, preferably in R.

Here is an example of my expected result when running str_detect() locally vs. in Spark.

# Load packages
library(dplyr)
library(stringr)
library(sparklyr)

# Example tibble
df <- tibble(foodtype = c("potatosalad", "potato", "salad"))
df

---
# A tibble: 3 x 1
  foodtype   
  <chr>      
1 potatosalad
2 potato     
3 salad 
---

# Expected result when using R
df %>% 
  mutate(contains_potato = str_detect(foodtype, "potato"))

---
# A tibble: 3 x 2
  foodtype    contains_potato
  <chr>       <lgl>          
1 potatosalad TRUE           
2 potato      TRUE           
3 salad       FALSE  
---

But when I run this code on a Spark dataframe it returns the following error message: "Error: str_detect() is not available in this SQL variant".

# Connect to local Spark cluster
sc <- spark_connect(master = "local", version = "3.0")

# Copy tibble to Spark cluster
df_spark <- copy_to(sc, df)
df_spark

# Error when using str_detect with Spark
df_spark %>% 
  mutate(contains_potato = str_detect(foodtype, "potato"))

---
Error: str_detect() is not available in this SQL variant
---

Solution

  • str_detect() is equivalent to Spark's rlike function. I don't use spark with R but something like this should work:

    df_spark %>% mutate(contains_potato = foodtype %rlike% "potato")
    

    dplyr accepts Spark functions written as R functions when there is no dplyr equivalent:

    df_spark %>% mutate(contains_potato = rlike(foodtype, "potato"))