Search code examples
rdropout

How to create a CUMULATIVE dropout rate table from raw data


I'm trying to modify a solution posted here Create cohort dropout rate table from raw data

I'd like to create a CUMULATIVE dropout rate table using these data.

DT<-data.table(
id =c (1,2,3,4,5,6,7,8,9,10,
     11,12,13,14,15,16,17,18,19,20,
     21,22,23,24,25,26,27,28,29,30,31,32,33,34,35),
year =c (2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,
       2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,
   2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016),
cohort =c(1,1,1,1,1,1,1,1,1,1,
        2,2,2,1,1,2,1,2,1,2,
        1,1,3,3,3,2,2,2,2,3,3,3,3,3,3))

So far, I've been able to get to this point

     library(tidyverse)

DT %>% 
  group_by(year) %>% 
  count(cohort) %>% 
  ungroup() %>% 
  spread(year, n) %>% 
  mutate(y2014_2015_dropouts = (`2014` - `2015`),
         y2015_2016_dropouts =  (`2015` - `2016`)) %>% 
  mutate(y2014_2015_cumulative =y2014_2015_dropouts/`2014`,
         y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative)%>%


  replace_na(list(y2014_2015_dropouts = 0.0,
                  y2015_2016_dropouts = 0.0)) %>% 
  select(cohort, y2014_2015_dropouts, y2015_2016_dropouts, y2014_2015_cumulative,y2015_2016_cumulative )

A cumulative dropout rate table reflects the proportion of students within a class who dropped out of school across years.

     # A tibble: 3 x 5
  cohort y2014_2015_dropouts y2015_2016_dropouts y2014_2015_cumulative y2015_2016_cumulative
   <dbl>               <dbl>               <dbl>                 <dbl>                 <dbl>
1      1                   6                   2                   0.6                   0.8
2      2                   0                   2                  NA                    NA  
3      3                   0                   0                  NA                    NA  
>  

The last two columns of the tibble show that by the end of year 2014-2015, 60% of cohort 1 students dropped out; and by the end of year 2015-2016, 80% of cohort 1 students had dropped out.

I'd like to calculate the same for cohorts 2 and 3, but I don't know how to do it.


Solution

  • Here is an alternative data.table solution that keeps your data organized in a way that I find easier to deal with. Using your DT input data:

    Organize and order by cohort and year:

    DT2 <- DT[, .N, list(cohort, year)][order(cohort, year)]
    

    Assign the year range:

    DT2[, year := paste(lag(year), year, sep = "_"),]
    

    Get dropouts per year

    DT2[, dropouts := ifelse(!is.na(lag(N)), lag(N) - N, 0), , cohort, ]
    

    Get the cumulative sum of proportion dropped out each year per cohort:

    DT2[, cumul := cumsum(dropouts) / max(N), cohort]
    

    Output:

    > DT2
       cohort      year  N dropouts     cumul
    1:      1   NA_2014 10        0 0.0000000
    2:      1 2014_2015  4        6 0.6000000
    3:      1 2015_2016  2        2 0.8000000
    4:      2 2016_2015  6        0 0.0000000
    5:      2 2015_2016  4        2 0.3333333
    6:      3 2016_2016  9        0 0.0000000