Search code examples
rdplyrsurvival-analysis

Count surviving individuals across time in r


I am trying to count the number of individuals that appear in each year of my data set, starting with the latest year. My data set looks something like this:

id grade school year
1 9 "17-18"
1 10 "18-19"
1 11 "19-20"
1 12 "20-21"
2 9 "17-18"
2 10 "18-19"
2 11 "20-21"
3 11 "20-21"
3 12 "21-22"
4 9 "17-18"
4 10 "18-19"
4 11 "19-20"
4 12 "20-21"

I want to know the number of student IDs in the year 20-21 that also appear in the 19-20, and persist to appear in the year 18-19, and again appear again in 17-18. I would like counts for the number IDs in 20-21 surviving each of the previous years.

I need a table that looks like this:

# of IDs in 20-21 #of IDs from 20-21 that were present in 19-20 #of IDs from 20-21 that were present in 18-19 #of IDs from 20-21 that were present in 17-18
200 180 150 140

I have no clue how to even begin calculating this.


Solution

  • You could do it with a filter() statement. In the code below, the filter is looking for groups where each of the four school year designations is present for a particular id. It is doing this by, for example, "17-18 %in% school_year, which looks for "17-18" among the values of school_year for each specific id.

    library(dplyr)
    dat <- tibble::tribble(
      ~id,  ~grade, ~school_year,
    1,  9,  "17-18",
    1,  10, "18-19",
    1,  11, "19-20",
    1,  12, "20-21",
    2,  9,  "17-18",
    2,  10, "18-19",
    2,  11, "20-21",
    3,  11, "20-21",
    3,  12, "21-22",
    4,  9,  "17-18",
    4,  10, "18-19",
    4,  11, "19-20",
    4,  12, "20-21")
    

    Here is the data filtered down to include only the id's that meet the condition.

    filt_dat <- dat %>% 
      group_by(id) %>% 
      filter("17-18" %in% school_year & "18-19" %in% school_year 
             & "19-20" %in% school_year & "20-21" %in% school_year) 
    filt_dat
    #> # A tibble: 8 × 3
    #> # Groups:   id [2]
    #>      id grade school_year
    #>   <dbl> <dbl> <chr>      
    #> 1     1     9 17-18      
    #> 2     1    10 18-19      
    #> 3     1    11 19-20      
    #> 4     1    12 20-21      
    #> 5     4     9 17-18      
    #> 6     4    10 18-19      
    #> 7     4    11 19-20      
    #> 8     4    12 20-21
    

    If you just want the unique individual id values, you could do this:

    filt_dat %>% 
      ungroup %>% 
      select(id) %>% 
      pull %>% 
      unique
    #> [1] 1 4
    

    Edit: Answer to updated question

    If an observation has to be in 19-20 to be counted in 18-19, then you can figure out whether ids are in each year, pivot the data longer and then take the cumulative product over the in-year values.

    dat %>% 
      group_by(id) %>% 
      summarise(in_20_21 = "20-21" %in% school_year, 
                in_19_20 = "19-20" %in% school_year, 
                in_18_19 = "18-19" %in% school_year, 
                in_17_18 = "17-18" %in% school_year) %>% 
      pivot_longer(-id, names_to="yr", values_to="surv") %>% 
      group_by(id) %>% 
      mutate(surv = cumprod(surv)) %>% 
      group_by(yr) %>% 
      summarise(n=sum(surv))
    #> # A tibble: 4 × 2
    #>   yr           n
    #>   <chr>    <dbl>
    #> 1 in_17_18     2
    #> 2 in_18_19     2
    #> 3 in_19_20     2
    #> 4 in_20_21     4
    

    If an observation doesn't have to be in 19-20 to be counted in 18-19, then you could filter by the observations who are in 20-21, and then just count the observations in each year:

    dat %>% 
      group_by(id) %>% 
      summarise(in_20_21 = "20-21" %in% school_year, 
                in_19_20 = "19-20" %in% school_year, 
                in_18_19 = "18-19" %in% school_year, 
                in_17_18 = "17-18" %in% school_year) %>% 
      filter(in_20_21) %>% 
      pivot_longer(-id, names_to="yr", values_to="surv") %>% 
      group_by(yr) %>% 
      summarise(n=sum(surv))
    #> # A tibble: 4 × 2
    #>   yr           n
    #>   <chr>    <int>
    #> 1 in_17_18     3
    #> 2 in_18_19     3
    #> 3 in_19_20     2
    #> 4 in_20_21     4
    

    Created on 2023-03-20 with reprex v2.0.2