I am trying to count the number of individuals that appear in each year of my data set, starting with the latest year. My data set looks something like this:
id | grade | school year |
---|---|---|
1 | 9 | "17-18" |
1 | 10 | "18-19" |
1 | 11 | "19-20" |
1 | 12 | "20-21" |
2 | 9 | "17-18" |
2 | 10 | "18-19" |
2 | 11 | "20-21" |
3 | 11 | "20-21" |
3 | 12 | "21-22" |
4 | 9 | "17-18" |
4 | 10 | "18-19" |
4 | 11 | "19-20" |
4 | 12 | "20-21" |
I want to know the number of student IDs in the year 20-21 that also appear in the 19-20, and persist to appear in the year 18-19, and again appear again in 17-18. I would like counts for the number IDs in 20-21 surviving each of the previous years.
I need a table that looks like this:
# of IDs in 20-21 | #of IDs from 20-21 that were present in 19-20 | #of IDs from 20-21 that were present in 18-19 | #of IDs from 20-21 that were present in 17-18 |
---|---|---|---|
200 | 180 | 150 | 140 |
I have no clue how to even begin calculating this.
You could do it with a filter()
statement. In the code below, the filter is looking for groups where each of the four school year designations is present for a particular id
. It is doing this by, for example, "17-18 %in% school_year
, which looks for "17-18"
among the values of school_year
for each specific id
.
library(dplyr)
dat <- tibble::tribble(
~id, ~grade, ~school_year,
1, 9, "17-18",
1, 10, "18-19",
1, 11, "19-20",
1, 12, "20-21",
2, 9, "17-18",
2, 10, "18-19",
2, 11, "20-21",
3, 11, "20-21",
3, 12, "21-22",
4, 9, "17-18",
4, 10, "18-19",
4, 11, "19-20",
4, 12, "20-21")
Here is the data filtered down to include only the id
's that meet the condition.
filt_dat <- dat %>%
group_by(id) %>%
filter("17-18" %in% school_year & "18-19" %in% school_year
& "19-20" %in% school_year & "20-21" %in% school_year)
filt_dat
#> # A tibble: 8 × 3
#> # Groups: id [2]
#> id grade school_year
#> <dbl> <dbl> <chr>
#> 1 1 9 17-18
#> 2 1 10 18-19
#> 3 1 11 19-20
#> 4 1 12 20-21
#> 5 4 9 17-18
#> 6 4 10 18-19
#> 7 4 11 19-20
#> 8 4 12 20-21
If you just want the unique individual id
values, you could do this:
filt_dat %>%
ungroup %>%
select(id) %>%
pull %>%
unique
#> [1] 1 4
If an observation has to be in 19-20 to be counted in 18-19, then you can figure out whether ids are in each year, pivot the data longer and then take the cumulative product over the in-year values.
dat %>%
group_by(id) %>%
summarise(in_20_21 = "20-21" %in% school_year,
in_19_20 = "19-20" %in% school_year,
in_18_19 = "18-19" %in% school_year,
in_17_18 = "17-18" %in% school_year) %>%
pivot_longer(-id, names_to="yr", values_to="surv") %>%
group_by(id) %>%
mutate(surv = cumprod(surv)) %>%
group_by(yr) %>%
summarise(n=sum(surv))
#> # A tibble: 4 × 2
#> yr n
#> <chr> <dbl>
#> 1 in_17_18 2
#> 2 in_18_19 2
#> 3 in_19_20 2
#> 4 in_20_21 4
If an observation doesn't have to be in 19-20 to be counted in 18-19, then you could filter by the observations who are in 20-21, and then just count the observations in each year:
dat %>%
group_by(id) %>%
summarise(in_20_21 = "20-21" %in% school_year,
in_19_20 = "19-20" %in% school_year,
in_18_19 = "18-19" %in% school_year,
in_17_18 = "17-18" %in% school_year) %>%
filter(in_20_21) %>%
pivot_longer(-id, names_to="yr", values_to="surv") %>%
group_by(yr) %>%
summarise(n=sum(surv))
#> # A tibble: 4 × 2
#> yr n
#> <chr> <int>
#> 1 in_17_18 3
#> 2 in_18_19 3
#> 3 in_19_20 2
#> 4 in_20_21 4
Created on 2023-03-20 with reprex v2.0.2