Search code examples
rstringdplyrsubstringsapply

Vectorized conditional string manipulation


I'm trying to make the following vectorized manipulation of date column in my data. I found a very inelegant solution and am sure there is a cleaner tidy solution. Toy example:

index <- c(1,2)
input <- c('11-9-2019', '11/01/2019-01/31/2020')
output <- c('11-9-2019', '11-01-2019')

df_in <- data.frame('index'=index, 'data'=input)

df_out <- data.frame('index' =index, 'data'=output)

I can solve the problem using sapply as follows:

df_out$data <- sapply(range(1:2), function(x) ifelse(str_length(df_in$data[x]) > 12, 
                                          str_sub(df_in$data[x], -10, -1), 
                                                  df_in$data[x]))
df_out$data <- str_replace_all(df_out$data, '/', '-')
df_out$data

Is there any way to do this a) with one vectorized line, b) without relying on string indices like I did in str_sub?

Thanks!


Solution

  • You can do it using gsub:

     gsub("(\\d{1,2})[/-](\\d{1,2})[/-](\\d{4}).*","\\1-\\2-\\3",df_in$data)
     [1] "11-9-2019"  "11-01-2019"
    

    Explanation if you're not familiar with regex:

    This searches for a string that has one or two digits ((\\d{1,2})), followed by a slash or a dash ([/-]), then one or two more digits, again a dash or a slash, and then four digits. It replaces these with just the three sets of digits separated by dashes, and removing anything that follows this first string.