I have a data set in R that looks something link this:
ClientID | Category | Date |
---|---|---|
Person1 | CategoryA | 2020-09-01 |
Person1 | CategoryA | 2020-09-30 |
Person2 | CategoryA | 2020-07-25 |
Person2 | CategoryA | 2020-08-31 |
Person1 | CategoryB | 2020-03-15 |
Person1 | CategoryB | 2020-09-14 |
Person2 | CategoryB | 2020-06-17 |
Person2 | CategoryB | 2020-10-10 |
What I would like to do is filter it to only return the row of the most recent date in each category for every client, so it would look like this:
ClientID | Category | Date |
---|---|---|
Person1 | CategoryA | 2020-09-30 |
Person1 | CategoryB | 2020-09-14 |
Person2 | CategoryA | 2020-08-31 |
Person2 | CategoryB | 2020-10-10 |
I have tried
library(plyr)
dataSet %>% filter(Category == "CategoryA",Date == max(Date))
I knew this wouldn't work the moment I typed it, but I am lost on where to go. I have considered subsetting the data on the different categories (there are only 4), but then I am still lost with filtering by max date per client (because at least then, I could rbind()
the results of each subset into a final data table). But alas, I am stuck.
Thanks in advance for the help.
I guess you can try
df %>%
group_by(ClientID, Category) %>%
filter(Date == max(Date))
which gives
ClientID Category Date
<chr> <chr> <date>
1 Person1 CategoryA 2020-09-30
2 Person2 CategoryA 2020-08-31
3 Person1 CategoryB 2020-09-14
4 Person2 CategoryB 2020-10-10
A base R option using subset
+ ave
subset(
df,
Date == ave(Date, ClientID, Category, FUN = max)
)
gives
ClientID Category Date
2 Person1 CategoryA 2020-09-30
4 Person2 CategoryA 2020-08-31
6 Person1 CategoryB 2020-09-14
8 Person2 CategoryB 2020-10-10
A data.table
option
> setDT(df)[, lapply(.SD, max), by = .(ClientID, Category)]
ClientID Category Date
1: Person1 CategoryA 2020-09-30
2: Person2 CategoryA 2020-08-31
3: Person1 CategoryB 2020-09-14
4: Person2 CategoryB 2020-10-10
Data
> dput(df)
structure(list(ClientID = c("Person1", "Person1", "Person2",
"Person2", "Person1", "Person1", "Person2", "Person2"), Category = c("CategoryA",
"CategoryA", "CategoryA", "CategoryA", "CategoryB", "CategoryB",
"CategoryB", "CategoryB"), Date = structure(c(18506, 18535, 18468,
18505, 18336, 18519, 18430, 18545), class = "Date")), row.names = c(NA,
-8L), class = "data.frame")