Search code examples
rdatetimelubridateposixct

Find the earliest and latest date within each row in R


I have large data set with over 400 columns which represent certain user input elements from an online platform and the time when each input occured. Each row represent a user ID.

200 of those columns are of class "POSIXct" "POSIXt" (e.g. 2019-11-04 15:33:50), and missing values can occure, as not every element is displayed to or filled in by every user.

My goal is to create two additional columns that include the earliest and the latest date per row of each of those 200 "POSIXct" "POSIXt" columns.

Here a simplified example of the frame and one of the desired additional columns. (ID 4 would be someone that never bothered to open the side, but has data from other data sources available, and should remain in the dataset for now)

ID Other_columns    date_column          date_column2          date_column3             max_date (what I want)
1  "numeric"        2019-11-04 19:33:50  2019-11-05 15:33:50   2019-11-05 16:33:50      2019-11-05 16:33:50
2  "numeric"        NA                   2019-11-04 17:20:10   2019-11-09 19:12:50      2019-11-09 19:12:50
3  "numeric"        2019-11-07 20:33:50  NA                    2019-11-04 18:31:50      2019-11-07 20:33:50
4  NA               NA                   NA                    NA                       NA

So far I did not really come further that filtering out the other non-date columns,

is.POSIXt <- function(x) inherits(x, "POSIXt")      
df%>%select(where(is.POSIXt))

Instead of the select I probably should use a mutate_at or something as condition, but what is the best way to check all of those remaining 200 date/time columns and then assign the earliest/latest date to the newly created columns (while ignoring the NA values).


Solution

  • We can use pmax and pmin on the 'date' columns to return the earliest and latest date for each row

    library(dplyr)
     df %>%
         mutate(max_date = do.call(pmax, c(select(., starts_with('date')), na.rm = TRUE)),
                min_date = do.call(pmin, c(select(., starts_with('date')), 
             na.rm = TRUE)))
    #  ID Other_columns         date_column        date_column2        date_column3            max_date            min_date
    #1  1       numeric 2019-11-04 19:33:50 2019-11-05 15:33:50 2019-11-05 16:33:50 2019-11-05 16:33:50 2019-11-04 19:33:50
    #2  2       numeric                <NA> 2019-11-04 17:20:10 2019-11-09 19:12:50 2019-11-09 19:12:50 2019-11-04 17:20:10
    #3  3       numeric 2019-11-07 20:33:50                <NA> 2019-11-04 18:31:50 2019-11-07 20:33:50 2019-11-04 18:31:50
    #4  4          <NA>                <NA>                <NA>                <NA>                <NA>                <NA>
    

    Or another option with rowwise with c_across

    df %>% 
       rowwise() %>% 
       mutate(max_date =  max(as.POSIXct(c_across(starts_with('date'))), 
             na.rm = TRUE),
              min_date = min(as.POSIXct(c_across(starts_with('date'))), 
             na.rm = TRUE))
    

    -output

    # A tibble: 4 x 7
    # Rowwise: 
    #     ID Other_columns date_column         date_column2        date_column3        max_date            min_date           
    #  <int> <chr>         <chr>               <chr>               <chr>               <dttm>              <dttm>             
    #1     1 numeric       2019-11-04 19:33:50 2019-11-05 15:33:50 2019-11-05 16:33:50 2019-11-05 16:33:50 2019-11-04 19:33:50
    #2     2 numeric       <NA>                2019-11-04 17:20:10 2019-11-09 19:12:50 2019-11-09 19:12:50 2019-11-04 17:20:10
    #3     3 numeric       2019-11-07 20:33:50 <NA>                2019-11-04 18:31:50 2019-11-07 20:33:50 2019-11-04 18:31:50
    #4     4 <NA>          <NA>                <NA>                <NA>                NA NA               NA NA        
    

    data

    df <- structure(list(ID = 1:4, Other_columns = c("numeric", "numeric", 
    "numeric", NA), date_column = c("2019-11-04 19:33:50", NA, "2019-11-07 20:33:50", 
    NA), date_column2 = c("2019-11-05 15:33:50", "2019-11-04 17:20:10", 
    NA, NA), date_column3 = c("2019-11-05 16:33:50", "2019-11-09 19:12:50", 
    "2019-11-04 18:31:50", NA)), class = "data.frame", row.names = c(NA, 
    -4L))