I have a column of roughly 1300 characters which I need to extract a single date from, if the character contains a date (i.e. if NA then no date to be taken) and if it contains multiple dates, I only need one; if it contains an interval of dates, I don't need the dates.
For example, here are 10 observations that best exemplify the different cases with what I'd like to get as comments to the side:
string1 <- 'Eff. 1/1/96 ACWD Res #96-006 Service' # need the date
string2 <- 'NA' # irrelevant
string3 <-'Effective 2/1/07' # need the date
string4 <- 'Effective: 3/01/2011' # need the date
string5 <- 'Eff. July 1, 1995 Ord. #92 Includes Cap Exp Ch' # need the date
string6 <- 'Effective: 2010-11' # need the date
string7 <- 'Eff. January 02' # need the date
string8 <- 'Effective 1/1/09 Billing (svc prd 10/15 - 12/15/08)' # need first date only, not intervals
string9 <- 'Eff. 9/1/95 Resolution No. 63-95 1st 1000 g. free' # need the date
string10 <- '(svc prd 10/15-12/15/08)' # don't need interval dates
So strings 1 & 3 & 8 & 9
(same formats), string4
, string5
, string6
, & string7
have different date formats. Further, string6
, & string7
have more issues. string6
can be taken as 1/1/10
(generally as 1/1/FIRST YEAR), whereas string7
has a year which can be identified by another character column, named FY
containing values such as FY 9596
, then string7
can be taken as 1/2/95
.
Desired output for the 10 strings should be:
(they could all also be YYYY-MM-DD
, that doesn't matter much so long as they are consistent)
1/1/96
NULL
2/1/07
3/1/11
7/1/95
1/1/10
1/2/95
1/1/09
9/1/95
NULL
When I test it on the 10 all at once, using the following
for(j in 1:10){
strapplyc(string[j], "\\d+/\\d+/\\d+", simplify = TRUE)
}
I get the following due to the structural differences in the instances date formats:
Error in if (nchar(s) > 0 && substring(s, 1, 1) == "\002") { :
missing value where TRUE/FALSE needed
In particular, string5
,string6
,string7
fails to return what I need, as expected I get NULL
instead; Further, string8
fails to return what I need, as I get
[,1]
[1,] "1/1/09"
[2,] "12/15/08"
Lastly, string10
fails to return what I need, I instead get 12/15/08
.
Is the most efficient approach to mutate
with ifelse
for string5
,string6
,string7
??; for string10
I thought to assign NULL if a date is preceded by -
as I thought that is likely signaling an interval which is irrelevant for my purpose but string6
contains a hyphen, which I need.
To the best of my knowledge I see somewhat related posts here here, & here. But thought this case was sufficiently different. Apologies in advance if that is not the case.
Any help is very much appreciated !!
Based off of @mnist's comment and a recognized pattern in my subsequent comment, I split the data (let myData
denote my data frame and String
denote the column of all 1300
string observations) with grepl
myData <- myData %>% filter(grepl("Eff|eff|Ef",String))
Then I again split myData
into 2 subsets, with Case 1 (nice case) corresponding to filter(grepl("\\d+/\\d+/\\d+", String))
and Case 2 corresponding to filter(!grepl("\\d+/\\d+/\\d+", String))
respectively. As it turns out, Case 2 (annoying case) only amounts to 3% of the observations (<50 obs) which I suppose I will deal with manually since it is not much.
Turns out Case 1 only had one observation like string8
so I corrected that manually.