Search code examples
rdplyrunique

Remove IDs with fewer than 9 unique observations


I am trying to filter my data and remove IDs that have fewer than 9 unique month observations. I would also like to create a list of IDs that includes the count.

I've tried using a few different options:

library(dplyr)
count <- bind %>% group_by(IDS) %>% filter(n(data.month)>= 9) %>%       ungroup()
count2 <- subset(bind, with(bind, IDS %in% names(which(table(data.month)>=9))))

Neither of these worked.

This is what my data looks like:

   data.month   ID
           01    2
           02    2
           03    2
           04    2
           05    2
           05    2
           06    2
           06    2
           07    2
           07    2
           07    2
           07    2
           07    2
           08    2
           09    2
           10    2
           11    2
           12    2
           01    5
           01    5
           02    5
           01    7
           01    7
           01    7
           01    4
           02    4
           03    4
           04    4
           05    4
           05    4
           06    4
           06    4
           07    4
           07    4
           07    4
           07    4
           07    4
           08    4
           09    4
           10    4
           11    4
           12    4

In the end, I would like a this:

IDs
2
3

I would also like this

IDs  Count
2     12
5     2
7     1
4     12

So far this code is the closest, but still just gives error codes:

count <- bind %>%
  group_by(IDs) %>% 
  filter(length(unique(bind$data.month >=9)))

Error in filter_impl(.data, quo) : Argument 2 filter condition does not evaluate to a logical vector


Solution

  • We can use n_distinct

    To remove IDs with less than 9 unique observations

    library(dplyr)
    
    df %>%
      group_by(ID) %>%
      filter(n_distinct(data.month) >= 9) %>%
      pull(ID) %>% unique
    
    #[1] 2 4
    

    Or

    df %>%
      group_by(ID) %>%
      filter(n_distinct(data.month) >= 9) %>%
      distinct(ID)
    
    #     ID
    #  <int>
    #1     2
    #2     4
    

    For unique counts of each ID

    df %>%
      group_by(ID) %>%
      summarise(count = n_distinct(data.month))
    
    
    #     ID count
    #   <int> <int>
    #1     2    12
    #2     4    12
    #3     5     2
    #4     7     1