Search code examples
rdate-formatting

Extract date from given string in r


string<-c("Posted 69 months ago (7/4/2011)")
library(gsubfn)
strapplyc(string, "(.*)", simplify = TRUE)

I apply above function but nothing happens.

In this I want to extract only date part i.e 7/4/2011.


Solution

  • The first one shows how to fix the code in the question to give the desired answer. The next 2 solutions are the same except they use different regular expressions. The fourth solution shows how to do it with gsub. The fifth breaks the gsub into two sub calls and the sixth uses read.table.

    1) Escape parens The problem is that ( and ) have special meaning in regular expressions so you must escape them if you want to match them literally. By using "[(]" as we do below (or writing them as "\\(" ) they are matched literally. The inner parentheses define the capture group as we don't want that group to include the literal parentheses themselves:

    strapplyc(string, "[(](.*)[)]", simplify = TRUE)
    ## [1] "7/4/2011"
    

    2) Match content Another way to do it is to match the data itself rather than the surrounding parentheses. Here "\\d+" matches one or more digits:

    strapplyc(string, "\\d+/\\d+/\\d+", simplify = TRUE)
    ## [1] "7/4/2011"
    

    You could specify the number of digits if you want to be even more specific but it seems unnecessary here if the data looks similar to that in the question.

    3) Match 8 or more digits and slashes Given that there are no other sequences of 8 or more characters consisting only of slashes and digits in the rest of the string we could just pick out that:

    strapplyc(string, "[0-9/]{8,}", simplify = TRUE)
    ## [1] "7/4/2011"
    

    4) Remove text before and after Another way of doing it is to remove everything up to the ( and after the ) like this:

    gsub(".*[(]|[)].*", "", string)
    ## [1] "7/4/2011"
    

    5) sub This is the same as (4) except it breaks the gsub into two sub invocations, one removing everything up to ( and the other removing ) onwards. The regular expressions are therefore slightly simpler.

    sub(".*\\(", "", sub("\\).*", "", string))
    

    6) read.table This solution uses no regular expressions at all. It defines sep and comment.char in read.table so that the second column of the result of read.table is the required date or dates.

    read.table(text = string, sep = "(", comment.char = ")", as.is = TRUE)$V2
    ## [1] "7/4/2011"
    

    7) trimws This trims everything on either end to the ( or ) and then trims the ( and ).

    string |>
      trimws(whitespace = "[^()]") |>
      trimws(whitespace = "[()]")
    ## [1] "7/4/2011"
    

    Note: Note that you don't need the c in defining string

    string <- c("Posted 69 months ago (7/4/2011)")
    string2 <- "Posted 69 months ago (7/4/2011)"
    identical(string, string2)
    ## [1] TRUE