I'm reading in data from another platform where a combination of the strings listed below is used for expressing timestamps:
\* = current time
t = current day (00:00)
mo = month
d = days
h = hours
m = minutes
For example, *-3d
is current time minus 3 days, t-3h
is three hours before today morning (midnight yesterday).
I'd like to be able to ingest these equations into R and get the corresponding POSIXct
value. I'm trying using regex in the below function but lose the numeric multiplier for each string:
strTimeConverter <- function(z){
ret <- stringi::stri_replace_all_regex(
str = z,
pattern = c('^\\*',
'^t',
'([[:digit:]]{1,})mo',
'([[:digit:]]{1,})d',
'([[:digit:]]{1,})h',
'([[:digit:]]{1,})m'),
replacement = c('Sys.time()',
'Sys.Date()',
'*lubridate::months(1)',
'*lubridate::days(1)',
'*lubridate::hours(1)',
'*lubridate::minutes(1)'),
vectorize_all = F
)
return(ret)
# return(eval(expr = parse(text = ret)))
}
> strTimeConverter('*-5mo+3d+4h+2m')
[1] "Sys.time()-*lubridate::months(1)+*lubridate::days(1)+*lubridate::hours(1)+*lubridate::minutes(1)"
> strTimeConverter('t-5mo+3d+4h+2m')
[1] "Sys.Date()-*lubridate::months(1)+*lubridate::days(1)+*lubridate::hours(1)+*lubridate::minutes(1)"
Expected output:
# *-5mo+3d+4h+2m
"Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+4*lubridate::minutes(1)"
# t-5mo+3d+4h+2m
"Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+4*lubridate::minutes(1)"
I assumed that wrapping the [[:digit]]{1,}
in parentheses ()
would preserve them but clearly that's not working. I defined the pattern like this else the code replaces repeat occurrences e.g. *
gets converted to Sys.time()
but then the m
in Sys.time()
gets replaced with *lubridate::minutes(1)
.
I plan on converting the (expected) output to R date-time using eval(parse(text = ...))
- currently commented out in the function.
I'm open to using other packages or approach.
Update
After tinkering around for a bit, I found the below version works - I'm replacing strings in the order such that newly replaced characters are not replaced again:
strTimeConverter <- function(z){
ret <- stringi::stri_replace_all_regex(
str = z,
pattern = c('y', 'd', 'h', 'mo', 'm', '^t', '^\\*'),
replacement = c('*years(1)',
'*days(1)',
'*hours(1)',
'*days(30)',
'*minutes(1)',
'Sys.Date()',
'Sys.time()'),
vectorize_all = F
)
ret <- gsub(pattern = '\\*', replacement = '*lubridate::', x = ret)
rdate <- (eval(expr = parse(text = ret)))
attr(rdate, 'tzone') <- 'UTC'
return(rdate)
}
sample_string <- '*-5mo+3d+4h+2m'
strTimeConverter(sample_string)
This works but is not very elegant and will likely fail as I'm forced to incorporate other expressions (e.g. yd
for day of the year e.g. 124).
You can use backreferences in the replacements like this:
library(stringr)
x <- c("*-5mo+3d+4h+2m", "t-5mo+3d+4h+2m")
repl <- c('^\\*' = 'Sys.time()', '^t' = 'Sys.Date()', '(\\d+)mo' = '\\1*lubridate::months(1)', '(\\d+)d' = '\\1*lubridate::days(1)', '(\\d+)h' = '\\1*lubridate::hours(1)', '(\\d+)m' = '\\1*lubridate::minutes(1)')
stringr::str_replace_all(x, repl)
## => [1] "Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
## [2] "Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
See the R demo online.
See, for example, '(\\d+)mo' = '\\1*lubridate::months(1)'
. Here, (\d+)mo
matches and captures into Group 1 one or more digits, and mo
is just matched. Then, when the match is found, \1
in \1*lubridate::months(1)
inserts the contents of Group 1 into the resulting string.
Note that it might make the replacements safer if you cap the time period match with a word boundary (\b
) on the right:
repl <- c('^\\*' = 'Sys.time()', '^t' = 'Sys.Date()', '(\\d+)mo\\b' = '\\1*lubridate::months(1)', '(\\d+)d\\b' = '\\1*lubridate::days(1)', '(\\d+)h\\b' = '\\1*lubridate::hours(1)', '(\\d+)m\\b' = '\\1*lubridate::minutes(1)')
It won't work if the time spans are glued one to another without any non-word delimiters, but you have +
in your example strings, so it is safe here.
Actually, you can make it work with the function you used, too. Just make sure the backreferences have the $n
syntax:
x <- c("*-5mo+3d+4h+2m", "t-5mo+3d+4h+2m")
pattern = c('^\\*', '^t', '(\\d+)mo', '(\\d+)d', '(\\d+)h', '(\\d+)m')
replacement = c('Sys.time()', 'Sys.Date()', '$1*lubridate::months(1)', '$1*lubridate::days(1)', '$1*lubridate::hours(1)', '$1*lubridate::minutes(1)')
stringi::stri_replace_all_regex(x, pattern, replacement, vectorize_all=FALSE)
Output:
[1] "Sys.time()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"
[2] "Sys.Date()-5*lubridate::months(1)+3*lubridate::days(1)+4*lubridate::hours(1)+2*lubridate::minutes(1)"