Say I have data like this:
id = c(1,1,2,2,3,3)
date = as.Date(c('2018-01-02', '2018-01-03', '2017-07-01', '2018-01-02', '2017-08-02', '2017-08-03'))
df <- data.frame(id, date)
id date
1 2018-01-02
1 2018-01-03
2 2017-07-01
2 2018-01-02
3 2017-08-02
3 2017-08-03
I want to filter out all ids that don't have a date less than 2018-01-01. This is the table I want to get to:
id date
2 2017-07-01
2 2018-01-02
3 2017-08-02
3 2017-08-03
I can filter out the groups I don't want with this:
library(dplyr)
df %>% group_by(id) %>%
summarise(min_date = min(date)) %>%
filter(min_date <= as.Date('2018-01-01'))
But that gives me the aggregated results.
id min_date
2 2017-07-01
3 2017-08-02
What I really want is the original unaggregated data with id 1s removed.
I'm using sparklyr and dplyr.
You can use group_by %>% filter
with an aggregated filter condition by group:
df %>% group_by(id) %>% filter(any(date < '2018-01-01'))
# note any(date < '2018-01-01') returns a boolean scalar for each group and determine whether
# rows in the group should be kept or not
# A tibble: 4 x 2
# Groups: id [2]
# id date
# <dbl> <date>
#1 2 2017-07-01
#2 2 2018-01-02
#3 3 2017-08-02
#4 3 2017-08-03