Search code examples
rapache-sparktextsparklyr

How to remove '\' from a string in sparklyr


I am using sparklyr and have a spark dataframe with a column wordthat contains words, some of which contain special characters which I want to remove. I was succesful in using regepx_replace and \\\\ before special characters, just like this:

words.sdf <- words.sdf %>% 
  mutate(word = regexp_replace(word, '\\\\(', '')) %>% 
  mutate(word = regexp_replace(word, '\\\\)', '')) %>% 
  mutate(word = regexp_replace(word, '\\\\+', '')) %>% 
  mutate(word = regexp_replace(word, '\\\\?', '')) %>%
  mutate(word = regexp_replace(word, '\\\\:', '')) %>%
  mutate(word = regexp_replace(word, '\\\\;', '')) %>%
  mutate(word = regexp_replace(word, '\\\\!', ''))

Now I want to remove \. I have tried both :

words.sdf <- words.sdf %>% 
  mutate(word = regexp_replace(word, '\\\\\', ''))

and :

words.sdf <- words.sdf %>% 
  mutate(word = regexp_replace(word, '\', ''))

But neither will work...


Solution

  • You have to correct your code for both R-side and Java side escaping so what you need is actually "\\\\\\\\":

    df <- copy_to(sc, tibble(word = "(abc\\zyx: 1)"))
    
    df %>% mutate(regexp_replace(word, "\\\\\\\\", ""))
    
    # Source:   lazy query [?? x 2]
    # Database: spark_shell_connection
      word           `regexp_replace(word, "\\\\\\\\\\\\\\\\", "")`
      <chr>          <chr>                                         
    1 "(abc\\zyx:1)" (abczyx: 1)  
    

    Depending on your exact requirement it might be easier to match all characters at once. You could for example preserve only word characters (\w) and whitespaces (\s):

    df %>% mutate(regexp_replace(word, "[^\\\\w+\\\\s+]", ""))
    
    # Source:   lazy query [?? x 2]
    # Database: spark_shell_connection
      word            `regexp_replace(word, "[^\\\\\\\\w+\\\\\\\\s+]", "")`
      <chr>           <chr>                                                
    1 "(abc\\zyx: 1)" abczyx 1     
    

    or word characters only

    df %>% mutate(regexp_replace(word, "[^\\\\w+]", ""))
    
    # Source:   lazy query [?? x 2]
    # Database: spark_shell_connection
      word            `regexp_replace(word, "[^\\\\\\\\w+]", "")`
      <chr>           <chr>                                      
    1 "(abc\\zyx: 1)" abczyx1