Search code examples
rapache-sparkdplyrsparklyr

Creating new variable by matching on strings in sparklyr


I'm using sparklyr for the first time and I'm having trouble matching strings of two vectors to create a new variable at scale. My problem has the following general structure:

I have one large dataset of urls:

df_1 <-  data.frame(
  col1 = c(1,2,3,4,5,6,7,8,9,10),
  col2 = c("john.com/abcd", "ringo.com/defg", "paul.com/hijk", "george.com/lmno", "rob.com/pqrs", "sam.com/tuvw", 
           "matt.com/xyza", "lenny.com/bcde", "bob.com/fghi", "tom.com/jklm"))
col1            col2
 1   john.com/abcd
 2  ringo.com/defg
 3   paul.com/hijk
 4 george.com/lmno
 5    rob.com/pqrs
 6    sam.com/tuvw
 7   matt.com/xyza
 8  lenny.com/bcde
 9    bob.com/fghi
10    tom.com/jklm

And another smaller dataset of general domains:

df_2 <- data.frame( 
  col1 = c(1,2,3,4,5,6,7),
  col2 = c("john.com", "jake.com", "tim.com", "paul.com", "rob.com", "harry.com", "chris.com"))
col1      col2
    1  john.com
    2  jake.com
    3   tim.com
    4  paul.com
    5   rob.com
    6 harry.com
    7 chris.com

I want to use the vector of domains in df_2 (df_2$col2) to create a dummy variable for df_1 that indicates if the domain occurs within the urls in df_1 (df_1$col_2). The resulting dataframe should look like df_3.

df_3 <- data.frame(
  col1 = c(1,2,3,4,5,6,7,8,9,10),
  col2 = c("john.com/abcd", "ringo.com/defg", "paul.com/hijk", "george.com/lmno", "rob.com/pqrs", "sam.com/tuvw", 
               "matt.com/xyza", "lenny.com/bcde", "bob.com/fghi", "tom.com/jklm"),
  col3 = c(1,0,1,0,1,0,0,0,0,0))
   col1            col2 col3
     1   john.com/abcd    1
     2  ringo.com/defg    0
     3   paul.com/hijk    1
     4 george.com/lmno    0
     5    rob.com/pqrs    1
     6    sam.com/tuvw    0
     7   matt.com/xyza    0
     8  lenny.com/bcde    0
     9    bob.com/fghi    0
    10    tom.com/jklm    0

I have read this post: How to filter on partial match using sparklyr

And have tried coding this for each individual observation of df_2 with something like,

df_3 <- df_1 %>%
  mutate(col3 = 
    ifelse(like(df_1$col2, "john.com") | df_1$col2, "jake.com" | etc.,1,0))

But so far I have been running into either stack limits or R not recognizing the like functions. There must be an easier way to do this. Thank you for any help.


Solution

  • If you're looking for a well defined prefix like here, you can extract it:

    sdf_1 <- copy_to(sc, df_1)
    sdf_2 <- copy_to(sc, df_2)
    
    sdf_1_keyed <- sdf_1 %>% mutate(key = regexp_extract(col2, "^(.*)/", 1))
    

    apply left equi-join:

    matched <- sdf_1_keyed %>% 
      left_join(sdf_2 %>% transmute(key = col2, id = col1), by="key")
    

    and summarise

    matched %>% group_by(col1, col2) %>% 
       summarise(col3 = as.numeric(sum(as.numeric(!is.na(id)), na.rm = TRUE) > 0))
    
    # Source:   lazy query [?? x 3]
    # Database: spark_connection
    # Groups:   col1
        col1 col2             col3
       <dbl> <chr>           <dbl>
     1     1 john.com/abcd       1
     2     5 rob.com/pqrs        1
     3     6 sam.com/tuvw        0
     4     9 bob.com/fghi        0
     5     3 paul.com/hijk       1
     6     4 george.com/lmno     0
     7     8 lenny.com/bcde      0
     8    10 tom.com/jklm        0
     9     2 ringo.com/defg      0
    10     7 matt.com/xyza       0
    # ... with more rows
    

    A similar thing can be done with RLIKE condition:

    candidates <- sdf_1 %>% spark_dataframe() %>% 
      sparklyr::invoke("crossJoin",
        sdf_2 %>% transmute(target = col2) %>% spark_dataframe()) %>% 
      sdf_register()
    
    candidates %>% 
        mutate(matched = as.numeric(rlike(col2, target))) %>% 
        group_by(col1, col2) %>% 
        summarise(col3 = as.numeric(sum(matched, na.rm=TRUE) > 0))
    
    # Source:   lazy query [?? x 3]
    # Database: spark_connection
    # Groups:   col1
        col1 col2             col3
       <dbl> <chr>           <dbl>
     1     1 john.com/abcd       1
     2     5 rob.com/pqrs        1
     3     6 sam.com/tuvw        0
     4     9 bob.com/fghi        0
     5     3 paul.com/hijk       1
     6     4 george.com/lmno     0
     7     8 lenny.com/bcde      0
     8    10 tom.com/jklm        0
     9     2 ringo.com/defg      0
    10     7 matt.com/xyza       0
    # ... with more rows
    

    Finally you could extract unique values:

    targets <- unique(as.character(df_2$col2))
    

    and create SQL expression:

    library(glue)
    
    expr <- glue_collapse(glue("col2 rlike '{targets}'"), " OR ")
    
    sdf_1 %>% 
      spark_dataframe() %>%
      sparklyr::invoke(
        "selectExpr", 
        list("*", as.character(glue("{expr} as col3")))) %>% 
      sdf_register() %>%
      mutate(col3 = as.numeric(col3))
    
     # Source:   lazy query [?? x 3]
     # Database: spark_connection
         col1 col2             col3
        <dbl> <chr>           <dbl>
      1     1 john.com/abcd       1
      2     2 ringo.com/defg      0
      3     3 paul.com/hijk       1
      4     4 george.com/lmno     0
      5     5 rob.com/pqrs        1
      6     6 sam.com/tuvw        0
      7     7 matt.com/xyza       0
      8     8 lenny.com/bcde      0
      9     9 bob.com/fghi        0
     10    10 tom.com/jklm        0
     # ... with more rows
    

    or R expression:

    library(rlang)
    
    rexpr <- glue_collapse(glue("rlike(col2, '{targets}')"), " | ")
    
    sdf_1 %>% mutate(col3 = !!parse_quosure(glue("as.numeric({rexpr})")))
    
    # Source:   lazy query [?? x 3]
    # Database: spark_connection
        col1 col2             col3
       <dbl> <chr>           <dbl>
     1     1 john.com/abcd       1
     2     2 ringo.com/defg      0
     3     3 paul.com/hijk       1
     4     4 george.com/lmno     0
     5     5 rob.com/pqrs        1
     6     6 sam.com/tuvw        0
     7     7 matt.com/xyza       0
     8     8 lenny.com/bcde      0
     9     9 bob.com/fghi        0
    10    10 tom.com/jklm        0
    # ... with more rows