Search code examples
rregexdplyrgreplgsubfn

Extracting dates in R, from a string variable with different date formats exhibiting lack of general structure / difficult pattern


I have a column of roughly 1300 characters which I need to extract a single date from, if the character contains a date (i.e. if NA then no date to be taken) and if it contains multiple dates, I only need one; if it contains an interval of dates, I don't need the dates.

For example, here are 10 observations that best exemplify the different cases with what I'd like to get as comments to the side:

string1 <- 'Eff. 1/1/96 ACWD Res #96-006 Service' # need the date
string2 <- 'NA' # irrelevant 
string3 <-'Effective 2/1/07' # need the date
string4 <- 'Effective: 3/01/2011' # need the date
string5 <- 'Eff. July 1, 1995 Ord. #92 Includes Cap Exp Ch' # need the date
string6 <- 'Effective: 2010-11' # need the date
string7 <- 'Eff. January 02' # need the date 
string8 <- 'Effective 1/1/09 Billing (svc prd 10/15 - 12/15/08)' # need first date only, not intervals
string9 <- 'Eff. 9/1/95 Resolution No. 63-95 1st 1000 g. free' # need the date
string10 <- '(svc prd 10/15-12/15/08)' # don't need interval dates

So strings 1 & 3 & 8 & 9 (same formats), string4, string5, string6, & string7 have different date formats. Further, string6, & string7 have more issues. string6 can be taken as 1/1/10(generally as 1/1/FIRST YEAR), whereas string7 has a year which can be identified by another character column, named FY containing values such as FY 9596, then string7 can be taken as 1/2/95.

Desired output for the 10 strings should be: (they could all also be YYYY-MM-DD, that doesn't matter much so long as they are consistent)

1/1/96
NULL
2/1/07
3/1/11
7/1/95
1/1/10
1/2/95
1/1/09
9/1/95
NULL

When I test it on the 10 all at once, using the following

for(j in 1:10){
strapplyc(string[j], "\\d+/\\d+/\\d+", simplify = TRUE)
}

I get the following due to the structural differences in the instances date formats:

Error in if (nchar(s) > 0 && substring(s, 1, 1) == "\002") { : 
  missing value where TRUE/FALSE needed

In particular, string5,string6,string7 fails to return what I need, as expected I get NULL instead; Further, string8 fails to return what I need, as I get

      [,1]      
[1,] "1/1/09"  
[2,] "12/15/08"

Lastly, string10 fails to return what I need, I instead get 12/15/08.

Is the most efficient approach to mutate with ifelse for string5,string6,string7??; for string10 I thought to assign NULL if a date is preceded by - as I thought that is likely signaling an interval which is irrelevant for my purpose but string6 contains a hyphen, which I need.

To the best of my knowledge I see somewhat related posts here here, & here. But thought this case was sufficiently different. Apologies in advance if that is not the case.

Any help is very much appreciated !!


Solution

  • Based off of @mnist's comment and a recognized pattern in my subsequent comment, I split the data (let myData denote my data frame and String denote the column of all 1300 string observations) with grepl

    myData <- myData %>% filter(grepl("Eff|eff|Ef",String))
    

    Then I again split myData into 2 subsets, with Case 1 (nice case) corresponding to filter(grepl("\\d+/\\d+/\\d+", String)) and Case 2 corresponding to filter(!grepl("\\d+/\\d+/\\d+", String)) respectively. As it turns out, Case 2 (annoying case) only amounts to 3% of the observations (<50 obs) which I suppose I will deal with manually since it is not much.

    Turns out Case 1 only had one observation like string8 so I corrected that manually.