I wrote an R script several months ago and part of that script converts character dates into date format.
I initially encountered this problem where NA
s were being introduced when I converted characters to date format.
It was suggested that the reason this happened was because it must be expecting the day
element of the date to be two characters, - such as June 12th 2018
- and only failed when the day
element contained a single character - such as June 2nd 2018
.
The proffered solution (as.Date(df$date, format='%B %d %Y')
) worked perfectly.
Until now.
Not only do I get NA
values, I also receive the error: Error: Duplicate identifiers for rows (12, 14), (13, 16)
.
I don't know what this means - can somebody explain?
This is the original data frame:
time.per.day Top.0.type Count
1 July 27th 2018, 00:00:00.000 conversation-archived 2
2 July 27th 2018, 00:00:00.000 conversation-archived 1
3 July 28th 2018, 00:00:00.000 conversation-archived 4
4 July 28th 2018, 00:00:00.000 conversation-archived 1
5 July 29th 2018, 00:00:00.000 conversation-archived 2
6 July 29th 2018, 00:00:00.000 conversation-archived 2
7 July 29th 2018, 00:00:00.000 conversation-auto-archived 2
8 July 30th 2018, 00:00:00.000 conversation-archived 3
9 July 30th 2018, 00:00:00.000 conversation-archived 2
10 July 30th 2018, 00:00:00.000 conversation-auto-archived 1
11 July 31st 2018, 00:00:00.000 conversation-archived 1
12 August 1st 2018, 00:00:00.000 conversation-archived 1
13 August 1st 2018, 00:00:00.000 conversation-auto-archived 1
14 August 2nd 2018, 00:00:00.000 conversation-archived 4
15 August 2nd 2018, 00:00:00.000 conversation-archived 1
16 August 2nd 2018, 00:00:00.000 conversation-auto-archived 2
This is the raw data:
df <- structure(list(time.per.day = c("July 27th 2018, 00:00:00.000",
"July 27th 2018, 00:00:00.000", "July 28th 2018, 00:00:00.000",
"July 28th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000",
"July 29th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000",
"July 30th 2018, 00:00:00.000", "July 30th 2018, 00:00:00.000",
"July 30th 2018, 00:00:00.000", "July 31st 2018, 00:00:00.000",
"August 1st 2018, 00:00:00.000", "August 1st 2018, 00:00:00.000",
"August 2nd 2018, 00:00:00.000", "August 2nd 2018, 00:00:00.000",
"August 2nd 2018, 00:00:00.000"), Top.0.type = c("conversation-archived",
"conversation-archived", "conversation-archived", "conversation-archived",
"conversation-archived", "conversation-archived", "conversation-auto-archived",
"conversation-archived", "conversation-archived", "conversation-auto-archived",
"conversation-archived", "conversation-archived", "conversation-auto-archived",
"conversation-archived", "conversation-archived", "conversation-auto-archived"
), Count = c(2L, 1L, 4L, 1L, 2L, 2L, 2L, 3L, 2L, 1L, 1L, 1L,
1L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA, -16L
))
I rename the columns (colnames(df) <- c("date", "type", "retailer_code", "count")
) and manipulate the data to look a certain way, but now after using as.Date(df$date, format='%B %d %Y')
, do some other maintenance:
# Remove time and identifiers from date column
df$date <- gsub(", 00:00:00.000", "", df$date)
df$date <- gsub("st", "", df$date)
df$date <- gsub("nd", "", df$date)
df$date <- gsub("rd", "", df$date)
df$date <- gsub("th", "", df$date)
this is the resulting data frame:
date type count
1 2018-07-27 Completed 2
2 2018-07-27 Completed 1
3 2018-07-28 Completed 4
4 2018-07-28 Completed 1
5 2018-07-29 Completed 2
6 2018-07-29 Completed 2
7 2018-07-29 Missed 2
8 2018-07-30 Completed 3
9 2018-07-30 Completed 2
10 2018-07-30 Missed 1
11 2018-07-31 Completed 1
12 <NA> Completed 1
13 <NA> Missed 1
14 <NA> Completed 4
15 <NA> Completed 1
16 <NA> Missed 2
Here is a dput
of the the resulting data frame:
df <- structure(list(date = structure(c(17739, 17739, 17740, 17740,
17741, 17741, 17741, 17742, 17742, 17742, 17743, NA, NA, NA,
NA, NA), class = "Date"), type = c("Completed", "Completed",
"Completed", "Completed", "Completed", "Completed", "Missed",
"Completed", "Completed", "Missed", "Completed", "Completed",
"Missed", "Completed", "Completed", "Missed"), count = c(2L,
1L, 4L, 1L, 2L, 2L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-16L))
Why is this going wrong now?
It was brought to my attention that df$date <- gsub("st", "", df$date)
was converting August
to Augu
and as such this was causing the NA values to appear.
I changed it to df$date <- gsub("1st", "", df$date)
, but this now causes other problems with the resulting data frame (rows 12-16 (inclusive)):
date type count
1 2018-07-27 Completed 2
2 2018-07-27 Completed 1
3 2018-07-28 Completed 4
4 2018-07-28 Completed 1
5 2018-07-29 Completed 2
6 2018-07-29 Completed 2
7 2018-07-29 Missed 2
8 2018-07-30 Completed 3
9 2018-07-30 Completed 2
10 2018-07-30 Missed 1
11 2018-07-03 Completed 1
12 0018-08-20 Completed 1
13 0018-08-20 Missed 1
14 0018-08-20 Completed 4
15 0018-08-20 Completed 1
16 0018-08-20 Missed 2
How can this be fixed?
Originally,
df$date <- gsub("st", "", df$date)
was causing an issue as it was matching "st" of "August" as well as "1st". To overcome that we just need to replace "1st" with "1" as we need the date.
df$date <- gsub("1st", "1", df$date)
and then convert to date.
as.Date(df$date, "%B %d %Y")
#[1] "2018-07-27" "2018-07-27" "2018-07-28" "2018-07-28" "2018-07-29" "2018-07-29"
#[7] "2018-07-29" "2018-07-30" "2018-07-30" "2018-07-30" "2018-07-31" "2018-08-01"
#[13] "2018-08-01" "2018-08-02" "2018-08-02" "2018-08-02"
Ideally, hardcoding and replacing values is not a good idea which can cause such issues, hence, we can instead replace the values when a digit is followed by ordinal in one step instead of 4 separate sub
s.
So after
df$date <- sub(", 00:00:00.000", "", df$date)
we can directly do,
df$date <- sub("(\\d+)(st|nd|rd|th)\\b", "\\1", df$date)