Search code examples
javarregexsparkr

Regex issue in regexp_replace


Problem

SparkR's regexp_replace should follow Java regex rules but I have hard times to identify certain symbols.

Reprex

In this reprex I manage to identify "<", "-" and "/" but not ">" or "+".

# Load packages
library(tidyverse)
library(sparklyr)
library(SparkR)

# Create data
df <- data.frame(test = c("<5", ">5", "3(a)", "a-a", "b+b", "c/c", "d  d", "3..3"))

# Transfer data to Spark memory
df <- copy_to(sc, df, "df", overwrite = TRUE)

# Modify data
df1 <- df %>%
  dplyr::mutate(
    test = regexp_replace(test, "[<]", "_"),
    test = regexp_replace(test, "[>]", "_"),
    test = regexp_replace(test, "[-]", "_"),
    test = regexp_replace(test, "[+]", "_"),
    test = regexp_replace(test, "[/]", "_"))


# Collect and print results
df2 <- df1 %>% as.data.frame()
df2

Solution

# Load packages
library(tidyverse)
library(sparklyr)
library(SparkR)

# Create data
df <- data.frame(test = c("<5", ">5", "3(a)", "a-a", "b+b", "c/c", "d  d", "3..3"))

# Transfer data to Spark memory
df <- copy_to(sc, df, "df", overwrite = TRUE)

# Modify data
df1 <- df %>%
  dplyr::mutate(
    test = regexp_replace(test, "[<>+/-]", "_"))


# Collect and print results
df2 <- df1 %>% as.data.frame()
df2

Solution

  • Not sure how sparkr work, but you could be able to do something like this:

    df1 <- df %>%
      dplyr::mutate(
        test = regexp_replace(test, "[<>+/-]", "_"),
    

    In the case of the / you might have to do:

        test = regexp_replace(test, "[<>+\\/-]", "_"),