Search code examples
rlistdate

Guess dates over list of dfs using `anytime`


Mock data:

my_list <- list(structure(list(start = c("2023-09-12T18:59:49.336+01:00", 
                                     "2023-09-18 07:29:34", "z2023-09-18 07:29:34"), country = c("USA", 
                                                                                                 "UK", "UK")), class = "data.frame", row.names = c(NA, -3L)), 
            structure(list(start = c("2023-09-16T21:12:51.288Z", "2019-11-30", 
                                     "2019-30-11"), country = c("UK", "Canada", "France")), class = "data.frame", row.names = c(NA, 
                                                                                                                                -3L)))

Using the second answer to this question (please note, the first answer does not work for me), I want to convert all dates to year-month-day.

I do:

library(anytime)
lapply(my_list, transform,
       start = anydate(start)
)

Output:

[[1]]
       start country
1 2023-09-12     USA
2 2023-09-18      UK
3       <NA>      UK

[[2]]
       start country
1 2023-09-16      UK
2 2019-11-30  Canada
3       <NA>  France

As one can see, it does not work for two dates, which are coerced to NA. Note, my real data has dozens of dfs and thousands of dates, with no unique format.

How do we do this? Desired output:

[[1]]
       start country
1 2023-09-12     USA
2 2023-09-18      UK
3 2023-09-18      UK

[[2]]
       start country
1 2023-09-16      UK
2 2019-11-30  Canada
3 2019-11-30  France

UPDATE

  1. Question was updated following Andre Wildberg's answer
  2. I need to return all columns contained in the dfs, not just date
  3. I ideally wish to use transform, and to be able to do any modification on other variables (i.e., not only dates, see example below) if needed
  4. Solution does not need to use anytime, though

(Example to illustrate Update point 3:)

lapply(my_list, transform,
       start = anydate(start),
       country = as.factor(country) # etc.
)

Further, final format need to be a Date and Factor, respectively:

List of 2
 $ :'data.frame':   3 obs. of  2 variables:
  ..$ start  : Date[1:3], format: "2023-09-12" "2023-09-18" NA
  ..$ country: Factor w/ 2 levels "UK","USA": 2 1 1
 $ :'data.frame':   3 obs. of  2 variables:
  ..$ start  : Date[1:3], format: "2023-09-16" "2019-11-30" NA
  ..$ country: Factor w/ 3 levels "Canada","France",..: 3 1 2

Solution

  • An approach with parse_date_time. Using sub to pull out the date part of the string.

    EDIT: with updated data

    library(lubridate)
    
    lapply(my_list, \(x) 
      setNames(data.frame(
          parse_date_time(
            sub(".*(\\d{4}-\\d+-\\d+).*", "\\1", x[,"start"]), 
            orders=c("Ymd", "Ydm")), 
          x[,-1]), # everything but the first "start" column
        names(x)))
    [[1]]
           start country
    1 2023-09-12     USA
    2 2023-09-18      UK
    3 2023-09-18      UK
    
    [[2]]
           start country
    1 2023-09-16      UK
    2 2019-11-30  Canada
    3 2019-11-30  France
    

    Keep in mind that ambivalent dates like 2023-08-09 <> 2023-09-08 are interpreted through the first rule Ymd

    Using transform

    lapply(my_list, \(x) 
      transform(x, 
        start=parse_date_time(sub(".*(\\d{4}-\\d+-\\d+).*", "\\1", start), 
                orders=c("Ymd", "Ydm")), 
        country=as.factor(country)))