Search code examples
rgroup-bydplyrsparklyr

How do I get back to the non-aggregated data after filtering out groups in dplyr within R?


Say I have data like this:

id = c(1,1,2,2,3,3)
date = as.Date(c('2018-01-02', '2018-01-03', '2017-07-01', '2018-01-02', '2017-08-02', '2017-08-03'))
df <- data.frame(id, date)

id date
1  2018-01-02
1  2018-01-03
2  2017-07-01
2  2018-01-02
3  2017-08-02
3  2017-08-03

I want to filter out all ids that don't have a date less than 2018-01-01. This is the table I want to get to:

id date
2  2017-07-01
2  2018-01-02
3  2017-08-02
3  2017-08-03

I can filter out the groups I don't want with this:

library(dplyr)
df %>% group_by(id) %>%
summarise(min_date = min(date)) %>%
filter(min_date <= as.Date('2018-01-01'))

But that gives me the aggregated results.

 id min_date    
  2 2017-07-01
  3 2017-08-02

What I really want is the original unaggregated data with id 1s removed.

I'm using sparklyr and dplyr.


Solution

  • You can use group_by %>% filter with an aggregated filter condition by group:

    df %>% group_by(id) %>% filter(any(date < '2018-01-01'))
    # note any(date < '2018-01-01') returns a boolean scalar for each group and determine whether
    # rows in the group should be kept or not
    
    # A tibble: 4 x 2
    # Groups:   id [2]
    #     id date      
    #  <dbl> <date>    
    #1     2 2017-07-01
    #2     2 2018-01-02
    #3     3 2017-08-02
    #4     3 2017-08-03