Search code examples
rregexstringrbackreferencestringi

Regex to convert time equations to R date-time (POSIXct)


I'm reading in data from another platform where a combination of the strings listed below is used for expressing timestamps:

\* = current time 
t = current day (00:00)
mo = month 
d = days 
h = hours
m = minutes 

For example, *-3d is current time minus 3 days, t-3h is three hours before today morning (midnight yesterday).

I'd like to be able to ingest these equations into R and get the corresponding POSIXct value. I'm trying using regex in the below function but lose the numeric multiplier for each string:

strTimeConverter <- function(z){
  ret <- stringi::stri_replace_all_regex(
    str = z, 
    pattern = c('^\\*', 
                '^t', 
                '([[:digit:]]{1,})mo', 
                '([[:digit:]]{1,})d', 
                '([[:digit:]]{1,})h',
                '([[:digit:]]{1,})m'),
    replacement = c('Sys.time()', 
                    'Sys.Date()', 
                    '*lubridate::months(1)', 
                    '*lubridate::days(1)', 
                    '*lubridate::hours(1)', 
                    '*lubridate::minutes(1)'),
    vectorize_all = F
  )
  return(ret)
  # return(eval(expr = parse(text = ret)))
}

> strTimeConverter('*-5mo+3d+4h+2m')
[1] "Sys.time()-*lubridate::months(1)+*lubridate::days(1)+*lubridate::hours(1)+*lubridate::minutes(1)"

> strTimeConverter('t-5mo+3d+4h+2m')
[1] "Sys.Date()-*lubridate::months(1)+*lubridate::days(1)+*lubridate::hours(1)+*lubridate::minutes(1)"

Expected output:

# *-5mo+3d+4h+2m
"Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+4*lubridate::minutes(1)"

# t-5mo+3d+4h+2m
"Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+4*lubridate::minutes(1)"

I assumed that wrapping the [[:digit]]{1,} in parentheses () would preserve them but clearly that's not working. I defined the pattern like this else the code replaces repeat occurrences e.g. * gets converted to Sys.time() but then the m in Sys.time() gets replaced with *lubridate::minutes(1).

I plan on converting the (expected) output to R date-time using eval(parse(text = ...)) - currently commented out in the function.

I'm open to using other packages or approach.

Update

After tinkering around for a bit, I found the below version works - I'm replacing strings in the order such that newly replaced characters are not replaced again:

strTimeConverter <- function(z){
  ret <- stringi::stri_replace_all_regex(
    str = z, 
    pattern = c('y', 'd', 'h', 'mo', 'm', '^t', '^\\*'),
    replacement = c('*years(1)',
                    '*days(1)', 
                    '*hours(1)', 
                    '*days(30)',
                    '*minutes(1)',
                    'Sys.Date()', 
                    'Sys.time()'),
    vectorize_all = F
  )
  ret <- gsub(pattern = '\\*', replacement = '*lubridate::', x = ret)
  rdate <- (eval(expr = parse(text = ret)))
  attr(rdate, 'tzone') <- 'UTC'
  return(rdate)
}
sample_string <- '*-5mo+3d+4h+2m'
strTimeConverter(sample_string)

This works but is not very elegant and will likely fail as I'm forced to incorporate other expressions (e.g. yd for day of the year e.g. 124).


Solution

  • You can use backreferences in the replacements like this:

    library(stringr)
    x <- c("*-5mo+3d+4h+2m", "t-5mo+3d+4h+2m")
    repl <- c('^\\*' = 'Sys.time()', '^t' = 'Sys.Date()', '(\\d+)mo' = '\\1*lubridate::months(1)', '(\\d+)d' = '\\1*lubridate::days(1)',  '(\\d+)h' =  '\\1*lubridate::hours(1)', '(\\d+)m' = '\\1*lubridate::minutes(1)')
    stringr::str_replace_all(x, repl)
    ## => [1] "Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
    ##    [2] "Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
    
    

    See the R demo online.

    See, for example, '(\\d+)mo' = '\\1*lubridate::months(1)'. Here, (\d+)mo matches and captures into Group 1 one or more digits, and mo is just matched. Then, when the match is found, \1 in \1*lubridate::months(1) inserts the contents of Group 1 into the resulting string.

    Note that it might make the replacements safer if you cap the time period match with a word boundary (\b) on the right:

    repl <- c('^\\*' = 'Sys.time()', '^t' = 'Sys.Date()', '(\\d+)mo\\b' = '\\1*lubridate::months(1)', '(\\d+)d\\b' = '\\1*lubridate::days(1)',  '(\\d+)h\\b' =  '\\1*lubridate::hours(1)', '(\\d+)m\\b' = '\\1*lubridate::minutes(1)')
    

    It won't work if the time spans are glued one to another without any non-word delimiters, but you have + in your example strings, so it is safe here.

    Actually, you can make it work with the function you used, too. Just make sure the backreferences have the $n syntax:

    x <- c("*-5mo+3d+4h+2m", "t-5mo+3d+4h+2m")
    pattern = c('^\\*', '^t', '(\\d+)mo', '(\\d+)d', '(\\d+)h', '(\\d+)m')
    replacement = c('Sys.time()', 'Sys.Date()', '$1*lubridate::months(1)', '$1*lubridate::days(1)', '$1*lubridate::hours(1)', '$1*lubridate::minutes(1)')
    stringi::stri_replace_all_regex(x, pattern, replacement, vectorize_all=FALSE)
    

    Output:

    [1] "Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
    [2] "Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"