Search code examples
rdplyrtidyrplyr

Is there a way to filter a row in R based on a categorical column and a date column


I have a data set in R that looks something link this:

ClientID Category Date
Person1 CategoryA 2020-09-01
Person1 CategoryA 2020-09-30
Person2 CategoryA 2020-07-25
Person2 CategoryA 2020-08-31
Person1 CategoryB 2020-03-15
Person1 CategoryB 2020-09-14
Person2 CategoryB 2020-06-17
Person2 CategoryB 2020-10-10

What I would like to do is filter it to only return the row of the most recent date in each category for every client, so it would look like this:

ClientID Category Date
Person1 CategoryA 2020-09-30
Person1 CategoryB 2020-09-14
Person2 CategoryA 2020-08-31
Person2 CategoryB 2020-10-10

I have tried

library(plyr)

dataSet %>% filter(Category == "CategoryA",Date == max(Date))

I knew this wouldn't work the moment I typed it, but I am lost on where to go. I have considered subsetting the data on the different categories (there are only 4), but then I am still lost with filtering by max date per client (because at least then, I could rbind() the results of each subset into a final data table). But alas, I am stuck.

Thanks in advance for the help.


Solution

  • I guess you can try

    df %>%
      group_by(ClientID, Category) %>%
      filter(Date == max(Date))
    

    which gives

      ClientID Category  Date
      <chr>    <chr>     <date>
    1 Person1  CategoryA 2020-09-30
    2 Person2  CategoryA 2020-08-31
    3 Person1  CategoryB 2020-09-14
    4 Person2  CategoryB 2020-10-10
    

    A base R option using subset + ave

    subset(
      df,
      Date == ave(Date, ClientID, Category, FUN = max)
    )
    

    gives

      ClientID  Category       Date
    2  Person1 CategoryA 2020-09-30
    4  Person2 CategoryA 2020-08-31
    6  Person1 CategoryB 2020-09-14
    8  Person2 CategoryB 2020-10-10
    

    A data.table option

    > setDT(df)[, lapply(.SD, max), by = .(ClientID, Category)]
       ClientID  Category       Date
    1:  Person1 CategoryA 2020-09-30
    2:  Person2 CategoryA 2020-08-31
    3:  Person1 CategoryB 2020-09-14
    4:  Person2 CategoryB 2020-10-10
    

    Data

    > dput(df)
    structure(list(ClientID = c("Person1", "Person1", "Person2", 
    "Person2", "Person1", "Person1", "Person2", "Person2"), Category = c("CategoryA",
    "CategoryA", "CategoryA", "CategoryA", "CategoryB", "CategoryB",
    "CategoryB", "CategoryB"), Date = structure(c(18506, 18535, 18468,
    18505, 18336, 18519, 18430, 18545), class = "Date")), row.names = c(NA,
    -8L), class = "data.frame")