Search code examples
rdatetimetimestrftimedatetime-conversion

Conversion of large dataframe column of string to datetime - unexplainable behavior


I have a file containing date and time information as a character string. I want to convert it to a datetime POSIXlt object and to that end I use the strftime() base function in R.

While the command works fine when I am passing it a character string, or when I apply it a small slice of the dataframe, when I apply it in a larger slice or in the entire datadrame it fails to capture the Hour. Please see below:

d = "2017-11-18 01:00:00"
t = strftime(d, format = "%Y-%m-%d %H:%M")
t
'2017-11-18 01:00'

head(data %>% dplyr::slice(1:1000) %>% mutate(DateTime1 = strftime(DateTime, format = "%Y-%m-%d %H:%M")))
DateTime    Junction    Vehicles    ID  DateTime1
2015-11-01 00:00:00 1   15  20151101001 2015-11-01 00:00
2015-11-01 01:00:00 1   13  20151101011 2015-11-01 01:00
2015-11-01 02:00:00 1   10  20151101021 2015-11-01 02:00
2015-11-01 03:00:00 1   7   20151101031 2015-11-01 03:00
2015-11-01 04:00:00 1   9   20151101041 2015-11-01 04:00
2015-11-01 05:00:00 1   6   20151101051 2015-11-01 05:00

head(data %>% mutate(DateTime1 = strftime(DateTime, format = "%Y-%m-%d %H:%M")))
DateTime    Junction    Vehicles    ID  DateTime1
2015-11-01 00:00:00 1   15  20151101001 2015-11-01 00:00
2015-11-01 01:00:00 1   13  20151101011 2015-11-01 00:00
2015-11-01 02:00:00 1   10  20151101021 2015-11-01 00:00
2015-11-01 03:00:00 1   7   20151101031 2015-11-01 00:00
2015-11-01 04:00:00 1   9   20151101041 2015-11-01 00:00
2015-11-01 05:00:00 1   6   20151101051 2015-11-01 00:00

How this erratic behavior is to be explained and how can I convert the datetime column for the entire dataset?

Your advice will be appreciated.


Solution

  • I can't tell exactly, but there might be some inconsistency in your DateTime variable. Since it seems to be a character variable, when the vectorized version of your conversion finds an unexpected behaviour it won't handle it well because the same conversion is applied to every entry (it'll use a format that works in all cases), that is probably the reason why it worked in a slice. I'd suggest either gradually increasing your slice size in order to track where the inconsistency appears (there might be a smarter way to check it).

    Another note is that strftime requires its first argument to be convertible to POSIXlt, so I'd recommend passing a POSIXlt object to avoid your function choosing how to convert your original string. You could use for example:

    t = strftime(strptime(d, format = "%Y-%m-%d %H:%M:%OS"), format = "%Y-%m-%d %H:%M")
    

    Then, hopefully, when converting your character variable to a POSIXlt the way you expect it to be, via strptime, it will throw an error due to inconsistent rows.

    A third, but not as good solution would be to use rowwise(), what would convert each row separately, producing expected results except for the inconsistencies (It'd look for appropriate format in each row, instead of one format that works for everyone).