Search code examples
rdistinct-values

R: dataframe Selecting the maximum row per ID based on first timestamp


I have a data frame which contains records that have time stamps. The toy example below contains an ID with 2 SMS's attached to it based on two different time stamps. In reality there would be thousands of IDs each with almost 80-100 SMS Types and dates

toydf <- data.frame(ID = c(1045937900, 1045937900), 
                    SMS.Type = c("DF1", "WCB14"), 
                    SMS.Date = c("12/02/2015 19:51", "13/02/2015 08:38"))

I want to be able to create a new dataframe that only contains the the record of the SMS type for the first SMS.Date or even the last

I have had a look at using duplicated, I have also thought about sorting the date column in descending order per ID and adding a new column which puts a 1 next to the first instance of the ID and a zero if the current ID is equal to the previous ID. I suspect this will get heavy if the number of records increases dramatically

Does anyone know a more elegant way of doing this - maybe using data.table

Thanks for your time


Solution

  • Try

    library(dplyr)
    toydf %>% 
       group_by(ID) %>%
       arrange(desc(as.POSIXct(SMS.Date, format='%d/%m/%Y %H:%M'))) %>% 
       slice(1L)
    

    Or using data.table

    library(data.table)
    toydf$SMS.Date <- as.POSIXct(toydf$SMS.Date, format='%d/%m/%Y %H:%M')
    setkey(setDT(toydf), ID, SMS.Date)[, .SD[.N], ID]