Search code examples
rdatedate-conversion

Why doesn't my date conversion solution work any more, despite nothing changing?


I wrote an R script several months ago and part of that script converts character dates into date format.

I initially encountered this problem where NAs were being introduced when I converted characters to date format.

It was suggested that the reason this happened was because it must be expecting the day element of the date to be two characters, - such as June 12th 2018 - and only failed when the day element contained a single character - such as June 2nd 2018.

The proffered solution (as.Date(df$date, format='%B %d %Y')) worked perfectly.

Until now.

Not only do I get NA values, I also receive the error: Error: Duplicate identifiers for rows (12, 14), (13, 16).

I don't know what this means - can somebody explain?

This is the original data frame:

                    time.per.day                 Top.0.type Count
1   July 27th 2018, 00:00:00.000      conversation-archived     2
2   July 27th 2018, 00:00:00.000      conversation-archived     1
3   July 28th 2018, 00:00:00.000      conversation-archived     4
4   July 28th 2018, 00:00:00.000      conversation-archived     1
5   July 29th 2018, 00:00:00.000      conversation-archived     2
6   July 29th 2018, 00:00:00.000      conversation-archived     2
7   July 29th 2018, 00:00:00.000 conversation-auto-archived     2
8   July 30th 2018, 00:00:00.000      conversation-archived     3
9   July 30th 2018, 00:00:00.000      conversation-archived     2
10  July 30th 2018, 00:00:00.000 conversation-auto-archived     1
11  July 31st 2018, 00:00:00.000      conversation-archived     1
12 August 1st 2018, 00:00:00.000      conversation-archived     1
13 August 1st 2018, 00:00:00.000 conversation-auto-archived     1
14 August 2nd 2018, 00:00:00.000      conversation-archived     4
15 August 2nd 2018, 00:00:00.000      conversation-archived     1
16 August 2nd 2018, 00:00:00.000 conversation-auto-archived     2

This is the raw data:

df <- structure(list(time.per.day = c("July 27th 2018, 00:00:00.000", 
"July 27th 2018, 00:00:00.000", "July 28th 2018, 00:00:00.000", 
"July 28th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000", 
"July 29th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000", 
"July 30th 2018, 00:00:00.000", "July 30th 2018, 00:00:00.000", 
"July 30th 2018, 00:00:00.000", "July 31st 2018, 00:00:00.000", 
"August 1st 2018, 00:00:00.000", "August 1st 2018, 00:00:00.000", 
"August 2nd 2018, 00:00:00.000", "August 2nd 2018, 00:00:00.000", 
"August 2nd 2018, 00:00:00.000"), Top.0.type = c("conversation-archived", 
"conversation-archived", "conversation-archived", "conversation-archived", 
"conversation-archived", "conversation-archived", "conversation-auto-archived", 
"conversation-archived", "conversation-archived", "conversation-auto-archived", 
"conversation-archived", "conversation-archived", "conversation-auto-archived", 
"conversation-archived", "conversation-archived", "conversation-auto-archived"
), Count = c(2L, 1L, 4L, 1L, 2L, 2L, 2L, 3L, 2L, 1L, 1L, 1L, 
1L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA, -16L
))

I rename the columns (colnames(df) <- c("date", "type", "retailer_code", "count")) and manipulate the data to look a certain way, but now after using as.Date(df$date, format='%B %d %Y'), do some other maintenance:

 # Remove time and identifiers from date column
df$date <- gsub(", 00:00:00.000", "", df$date)
df$date <- gsub("st", "", df$date)
df$date <- gsub("nd", "", df$date)
df$date <- gsub("rd", "", df$date)
df$date <- gsub("th", "", df$date)

this is the resulting data frame:

         date      type count
1  2018-07-27 Completed     2
2  2018-07-27 Completed     1
3  2018-07-28 Completed     4
4  2018-07-28 Completed     1
5  2018-07-29 Completed     2
6  2018-07-29 Completed     2
7  2018-07-29    Missed     2
8  2018-07-30 Completed     3
9  2018-07-30 Completed     2
10 2018-07-30    Missed     1
11 2018-07-31 Completed     1
12       <NA> Completed     1
13       <NA>    Missed     1
14       <NA> Completed     4
15       <NA> Completed     1
16       <NA>    Missed     2

Here is a dput of the the resulting data frame:

df <- structure(list(date = structure(c(17739, 17739, 17740, 17740, 
17741, 17741, 17741, 17742, 17742, 17742, 17743, NA, NA, NA, 
NA, NA), class = "Date"), type = c("Completed", "Completed", 
"Completed", "Completed", "Completed", "Completed", "Missed", 
"Completed", "Completed", "Missed", "Completed", "Completed", 
"Missed", "Completed", "Completed", "Missed"), count = c(2L, 
1L, 4L, 1L, 2L, 2L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA, 
-16L))

Why is this going wrong now?


It was brought to my attention that df$date <- gsub("st", "", df$date) was converting August to Augu and as such this was causing the NA values to appear.

I changed it to df$date <- gsub("1st", "", df$date), but this now causes other problems with the resulting data frame (rows 12-16 (inclusive)):

         date      type count
1  2018-07-27 Completed     2
2  2018-07-27 Completed     1
3  2018-07-28 Completed     4
4  2018-07-28 Completed     1
5  2018-07-29 Completed     2
6  2018-07-29 Completed     2
7  2018-07-29    Missed     2
8  2018-07-30 Completed     3
9  2018-07-30 Completed     2
10 2018-07-30    Missed     1
11 2018-07-03 Completed     1
12 0018-08-20 Completed     1
13 0018-08-20    Missed     1
14 0018-08-20 Completed     4
15 0018-08-20 Completed     1
16 0018-08-20    Missed     2

How can this be fixed?


Solution

  • Originally,

    df$date <- gsub("st", "", df$date)
    

    was causing an issue as it was matching "st" of "August" as well as "1st". To overcome that we just need to replace "1st" with "1" as we need the date.

    df$date <- gsub("1st", "1", df$date)
    

    and then convert to date.

    as.Date(df$date, "%B %d %Y")
    
    #[1]  "2018-07-27" "2018-07-27" "2018-07-28" "2018-07-28" "2018-07-29" "2018-07-29"
    #[7]  "2018-07-29" "2018-07-30" "2018-07-30" "2018-07-30" "2018-07-31" "2018-08-01"
    #[13] "2018-08-01" "2018-08-02" "2018-08-02" "2018-08-02"
    

    Ideally, hardcoding and replacing values is not a good idea which can cause such issues, hence, we can instead replace the values when a digit is followed by ordinal in one step instead of 4 separate subs.

    So after

    df$date <- sub(", 00:00:00.000", "", df$date)
    

    we can directly do,

    df$date <- sub("(\\d+)(st|nd|rd|th)\\b", "\\1", df$date)