Search code examples
rdplyrcountunique

Count unique values for entire data set group by years


I have a data set that looks like this but is significantly larger, where the column pit are different values but occasionally the same:

Date pit
11/23/21 456256
10/13/20 332874
11/23/21 456111
10/13/20 332845
1/3/19 456256
10/13/20 332874
11/23/21 456220
10/13/20 652874
1/3/19 456874

I want to find the amount of unique values for "pit" and then group them by year. I've tried the code:

df %>% 
  mutate(DM = format(Date, "%Y")) %>%
  group_by(DM) %>% 
  summarise(pit = n_distinct(pit)) 

Which gives the output:

  DM      pit
  <chr> <int>
1 2020    169
2 2021    427
3 2022    316
4 2023     10

But if I run:

n_distinct(df$pit)

It gives the output:

[1] 901

If you you take the sum of the first output (169+427+316+10) it is equal to 922, which is larger than the output for n_distinct(df$pit), which is 901. I'm assuming this above code is counting unique values inside each year, which accounts for a greater total.

How can I count unique values and then sum them into the year of their first observation? This should result in a total that would match the correct number of unique values.


Solution

  • How can I count unique values and then sum them into the year of their first observation?

    First filter so each value only shows up in its first year, then count unique values.

    df %>% 
      mutate(DM = format(Date, "%Y")) %>%
      group_by(pit) %>%
      slice_min(DM, with_ties = FALSE) %>%
      ungroup() %>%
      group_by(DM) %>% 
      summarise(pit = n_distinct(pit)) 
    # # A tibble: 3 × 2
    #   DM      pit
    #   <chr> <int>
    # 1 2019      2
    # 2 2020      3
    # 3 2021      2
    

    Using this sample data:

    df = read.table(text = 'Date    pit
    11/23/21    456256
    10/13/20    332874
    11/23/21    456111
    10/13/20    332845
    1/3/19  456256
    10/13/20    332874
    11/23/21    456220
    10/13/20    652874
    1/3/19  456874', header = T) %>%
      mutate(Date = lubridate::mdy(Date))