Search code examples
rdata.tablepanel-data

How to match panel data based on year pattern?


I want to select individuals from a control pool to act as a control for a treated case. Individuals for the control are selected if their activity/inactivity in a year matches the treated's pattern of activity.

Eg. A treated case has: activity in 2009, 2011 but no activity in 2010 (so no record for that year). This treated's control should only be made up of individuals who also had activity in 2009 and 2011 and no activity in 2010.

I'm only looking at 3-year blocks in the treated activity pattern. So from the maximum year where there is activity to 2 years prior.

If I have separate data.tables for the treated and the control pool, how would I go about matching them?

Treated would look like:

treated <- data.table(id = c(1, 1, 1, 2, 2, 3, 3, 4),
                      yr = c(2010, 2011, 2012, 2011, 2012, 2010, 2012, 2013))
id     yr
1   2010            
1   2011            
1   2012            
2   2011            
2   2012            
3   2010            
3   2012            
4   2013

Control

control <- data.table(id = c(rep(5, 6), rep(6, 3), 7, 7, 8, 8, 8),
                      yr = c(2009:2014, 2011, 2012, 2013, 2010, 2012, 2009, 2013, 2014))
id     yr
5   2009            
5   2010            
5   2011            
5   2012            
5   2013            
5   2014            
6   2011            
6   2012            
6   2013            
7   2010    
7   2012            
8   2009
8   2013    
8   2014    

I've tried putting both in wide format so that NAs could be produced for the missing years but couldn't figure out how to match from there.

dcast(treated, id ~ yr, value.var = "yr")

id    2010    2011    2012    2013
-----------------------------------
1     2010    2011    2012     NA
2     NA      2011    2012     NA
3     2010    NA      2012     NA
4     NA      NA      NA       2013

So the activity/inactivity pattern for each of the treated would be:

id   pattern
=====================
1  2010  2011  2012
2  NA    2011  2012
3  2010  NA    2012
4  NA    NA    2013
dcast(control, id ~ yr, value.var = "yr")

id    2009    2010    2011    2012    2013    2014
----------------------------------------------------
5     2009    2010    2011    2012    2013    2014
6     NA      NA      2011    2012    2013    NA
7     NA      2010    NA      2012    NA      NA
8     2009    NA      NA      NA      2013    2014

So 1 should match with 5
2 with 6
3 with 7
4 with 8

Can anyone point me in the right direction?

(Not sure about the best way to handle this but right now I have a function that takes a treated case, selects individuals for the control and calculates effects. So I would be happy with a list of control ids that match a particular treated case. Then I'd use those ids to subset the main control data.table.)

For expected output, say the matching method was in this function:

get_control_ids <- function(treated_id){
  ...
  return(vector_of_control_ids)
}

Then running this function on the id 1

get_control_ids(1)

would result in a vector containing the control ids that match it. So using my tiny control pool, the function would return a vector containing just 5.

Edit: I'm not sure what the output should look like. So any tips on that would be handy too. Maybe a data.table like:

treated_id    control_ids
-----------------------------------
1               5, 10, 13
2               6, 22, 23

For more context, I have a calculate_effects(treated_key) function which works in this way:

I have 3 data.tables:
treated - has ids repeated with each year of activity (plus other columns)
treated_keys - has unique ids (in my actual data I have two columns to identify a particular treated)
control

I use my calculate_effects(treated_key) function like so:

results <- treated_keys[, calculate_effects(.SD), by = 1:nrow(treated_keys)]

Within the function, treated is subsetted using the particular treated_key which will extract all records belonging to that particular treated case.

Also within the function, control is subsetted according to some matching rules, and then used in calculations.

I would like to add code within my calculate_effects(treated_key) which will correctly select the control for that particular treated_key.


Solution

  • Assuming 3 yearly blocks, you can do the following:

    #expand treated to fill in gap years if any
    exptrt <- treated[, .(yr=seq(max(yr)-2L, max(yr))), by=.(id)][,
        att := 0L][
            treated, att := 1, on=.(id, yr)]
    
    #pivot control into id against yr
    pctrl <- dcast(control, id ~ yr, length, value.var = "yr")
    
    #for each id, pivot treated and join with control using 
    #whatever years are in treated incl gap years
    exptrt[, cid := id]
    exptrt[, {
        ptrt <- dcast(.SD, cid ~ yr, value.var="att")
    
        pctrl[ptrt, on=names(ptrt)[-1L], x.id]
    }, by=.(id)]
    

    output:

       id V1
    1:  1  5
    2:  2  6
    3:  3  7
    4:  4  8