Search code examples
rapache-sparkhivegsubsparklyr

change string in DF using hive command and mutate with sparklyr


Using the Hive command regexp_extract I am trying to change the following strings from:

201703170455 to 2017-03-17:04:55

and from:

2017031704555675 to 2017-03-17:04:55.0010

I am doing this in sparklyr trying to use this code that works with gsub in R:

  newdf<-df%>%mutate(Time1 = regexp_extract(Time, "(....)(..)(..)(..)(..)", "\\1-\\2-\\3:\\4:\\5"))

and this code:

newdf<-df%>mutate(TimeTrans = regexp_extract("(....)(..)(..)(..)(..)(....)", "\\1-\\2-\\3:\\4:\\5.\\6"))

but does not work at all. Any suggestions of how to do this using regexp_extract?


Solution

  • Apache Spark uses Java regular expression dialect not R, and groups should be referenced with $. Furthermore regexp_replace is used to extract a single group by a numeric index.

    You can use regexp_replace:

    df <- data.frame(time = c("201703170455", "2017031704555675"))
    sdf <- copy_to(sc, df)
    
    sdf %>% 
      mutate(time1 = regexp_replace(
        time, "^(....)(..)(..)(..)(..)$", "$1-$2-$3 $4:$5" )) %>%
      mutate(time2 = regexp_replace(
        time, "^(....)(..)(..)(..)(..)(....)$", "$1-$2-$3 $4:$5.$6"))
    

    Source:   query [2 x 3]
    Database: spark connection master=local[8] app=sparklyr local=TRUE
    
    # A tibble: 2 x 3
                  time            time1                 time2
                 <chr>            <chr>                 <chr>
    1     201703170455 2017-03-17 04:55          201703170455
    2 2017031704555675 2017031704555675 2017-03-17 04:55.5675