Search code examples
rfor-loopdata.tablesimulationpolicy

Using data.table in r to eliminate inner for loop


I have an inner for-loop in R which I have identified as significant bottleneck in my code. The script simulates the effect of a time-varying policy on individuals prior to adulthood. The outer loop runs over a list of cohorts (yob = 1910,...,1930 etc.) that I would like to study. The inner loop counts from ages from a = 5 to a = 17. CSL.details is a data.table that contains the details of each law that I am studying in form of the variables I grab, which vary by year = birthyear + a. To understand the overall effects of the policy by birth cohort, I need to track ca_years1, ca_years2, ca_years3, and ca_years4 for each a.

ages = seq.int(5,17)
state = "Massachusetts"
yob = seq.int(1910, 1930)
for (birthyear in yob){
    ca_years1 = 0; ca_years2 = 0; ca_years3 = 0; ca_years4 = 0;
    for (a in ages){
      thisyear = birthyear + a
      # Grab each law for given state and year and implement exemption permit
      thislaw <- CSL.details[statename == state & yob == birthyear & thisyear == year]
      if (nrow(thislaw) == 0) next
      exempt_workpermit = (ca_years2 >= thislaw$workyrs & a >= thislaw$workage & thislaw$workage > 0)
      exempt_yearstodropout = (ca_years3 >= thislaw$earlyyrs & a >= thislaw$earlyyrs_condition & thislaw$earlyyrs > 0)
      exempt_cont = ((ca_years2 + ca_years4) >= thislaw$contyrs & thislaw$contyrs > 0)
      # Increment each law when school is required
      if(thislaw$entryage <= a & a < thislaw$exitage){
        ca_years1 = ca_years1 + 1
        if(!exempt_workpermit){ca_years2 = ca_years2 + 1}
        if(!exempt_yearstodropout){ca_years3 = ca_years3 + 1}
      }
      if(thislaw$contage > a & 
         a >= thislaw$workage & 
         !exempt_cont & 
         thislaw$workage > 0 &
         !(thislaw$entryage <= a & a < thislaw$exitage & !exempt_workpermit)
      ){ca_years4 = ca_years4 + 1}
      
    }
    CSL.exposures[statename == state & yob == birthyear]$ca_years1 = ca_years1
    CSL.exposures[statename == state & yob == birthyear]$ca_years2 = ca_years2
    CSL.exposures[statename == state & yob == birthyear]$ca_years3 = ca_years3
    CSL.exposures[statename == state & yob == birthyear]$ca_years4 = ca_years4
  }

Is there a data.table solution for replacing the inner-loop? I am an intermediate R coder and it is a bit difficult to think of how to get started. Although I would prefer data.table exclusively, I am open to dplyr-type solutions if they significantly speed up the code.

Edit: here is an example of what CSL.detail looks like, as a copy-pasted data.table.

statename year  yob statefip entryage exitage earlyyrs earlyyrs_condition workage workyrs contage contyrs statecompschoolyr
    1: Massachusetts 1913 1800       25        7      16        4                 14      14       4      16       0              1852
    2: Massachusetts 1913 1801       25        7      16        4                 14      14       4      16       0              1852
    3: Massachusetts 1913 1802       25        7      16        4                 14      14       4      16       0              1852
    4: Massachusetts 1913 1803       25        7      16        4                 14      14       4      16       0              1852
    5: Massachusetts 1913 1804       25        7      16        4                 14      14       4      16       0              1852

Solution

  • I managed to refactor the code to solve the problem. The key idea is to exploit state and yob as grouping variables (since all calculations happen within a state and yob pair). This completely eliminates the outer loops and requires only a single loop, iterating by age. I am just saving this answer here for reference, but I am not sure that there is a broader lesson for the stackoverflow.com community so feel free to delete. The time savings are on the order of 95%, primarily because it reduces the overhead time to call data.table.

    for(a in ages){
      # grab running total of years of education compelled by state and year of birth
      CSL.details[CSL.exposures, on = .(statename, yob), 
                  `:=` (ca_years1 = i.ca_years1,
                        ca_years2 = i.ca_years2,
                        ca_years3 = i.ca_years3,
                        ca_years4 = i.ca_years4)] %>%
        .[year == a + yob, 
          `:=`(
            # create exemptions by age based on number of years of schooling completed
            exempt_workpermit = (ca_years2 >= workyrs & a >= workage & workage > 0),
            exempt_yearstodropout = (ca_years3 >= earlyyrs & a >= earlyyrs_condition & earlyyrs > 0),
            exempt_cont = ((ca_years2 + ca_years4) >= contyrs & contyrs > 0)
          ), by = .(statename, yob)]
      
      CSL.exposures[
        CSL.details[year == a + yob], on = .(yob, statename),
        `:=` (exempt_workpermit = i.exempt_workpermit, exempt_yearstodropout = i.exempt_yearstodropout, 
              exempt_cont = i.exempt_cont, entryage  = i.entryage, 
              exitage = i.exitage, contage = i.contage, workage = i.workage) ] %>%
        .[ , 
           `:=` (
             ca_years1 = 
               fifelse(entryage <= a & a < exitage, 
                       ca_years1 + 1, ca_years1, na = as.numeric(ca_years1)),
             ca_years2 = 
               fifelse(entryage <= a & a < exitage & !exempt_workpermit, 
                       ca_years2 + 1, ca_years2, na = as.numeric(ca_years2)),
             ca_years3 = 
               fifelse(entryage <= a & a < exitage & !exempt_yearstodropout, 
                       ca_years3 + 1, ca_years3, na = as.numeric(ca_years3)),
             ca_years4 = 
               fifelse(contage > a & a >= workage & !exempt_cont & 
                         workage > 0 &
                         !(entryage <= a & a < exitage & !exempt_workpermit),
                       ca_years4 + 1, ca_years4, na = as.numeric(ca_years4))),
           by = .(statename, yob)
        ]
    }