I have a data set that looks like this but is significantly larger, where the column pit are different values but occasionally the same:
Date | pit |
---|---|
11/23/21 | 456256 |
10/13/20 | 332874 |
11/23/21 | 456111 |
10/13/20 | 332845 |
1/3/19 | 456256 |
10/13/20 | 332874 |
11/23/21 | 456220 |
10/13/20 | 652874 |
1/3/19 | 456874 |
I want to find the amount of unique values for "pit" and then group them by year. I've tried the code:
df %>%
mutate(DM = format(Date, "%Y")) %>%
group_by(DM) %>%
summarise(pit = n_distinct(pit))
Which gives the output:
DM pit
<chr> <int>
1 2020 169
2 2021 427
3 2022 316
4 2023 10
But if I run:
n_distinct(df$pit)
It gives the output:
[1] 901
If you you take the sum of the first output (169+427+316+10) it is equal to 922, which is larger than the output for n_distinct(df$pit)
, which is 901. I'm assuming this above code is counting unique values inside each year, which accounts for a greater total.
How can I count unique values and then sum them into the year of their first observation? This should result in a total that would match the correct number of unique values.
How can I count unique values and then sum them into the year of their first observation?
First filter so each value only shows up in its first year, then count unique values.
df %>%
mutate(DM = format(Date, "%Y")) %>%
group_by(pit) %>%
slice_min(DM, with_ties = FALSE) %>%
ungroup() %>%
group_by(DM) %>%
summarise(pit = n_distinct(pit))
# # A tibble: 3 × 2
# DM pit
# <chr> <int>
# 1 2019 2
# 2 2020 3
# 3 2021 2
Using this sample data:
df = read.table(text = 'Date pit
11/23/21 456256
10/13/20 332874
11/23/21 456111
10/13/20 332845
1/3/19 456256
10/13/20 332874
11/23/21 456220
10/13/20 652874
1/3/19 456874', header = T) %>%
mutate(Date = lubridate::mdy(Date))