I'm trying to make the following vectorized manipulation of date column in my data. I found a very inelegant solution and am sure there is a cleaner tidy solution. Toy example:
index <- c(1,2)
input <- c('11-9-2019', '11/01/2019-01/31/2020')
output <- c('11-9-2019', '11-01-2019')
df_in <- data.frame('index'=index, 'data'=input)
df_out <- data.frame('index' =index, 'data'=output)
I can solve the problem using sapply as follows:
df_out$data <- sapply(range(1:2), function(x) ifelse(str_length(df_in$data[x]) > 12,
str_sub(df_in$data[x], -10, -1),
df_in$data[x]))
df_out$data <- str_replace_all(df_out$data, '/', '-')
df_out$data
Is there any way to do this a) with one vectorized line, b) without relying on string indices like I did in str_sub?
Thanks!
You can do it using gsub
:
gsub("(\\d{1,2})[/-](\\d{1,2})[/-](\\d{4}).*","\\1-\\2-\\3",df_in$data)
[1] "11-9-2019" "11-01-2019"
Explanation if you're not familiar with regex:
This searches for a string that has one or two digits ((\\d{1,2})
), followed by a slash or a dash ([/-]
), then one or two more digits, again a dash or a slash, and then four digits. It replaces these with just the three sets of digits separated by dashes, and removing anything that follows this first string.