Search code examples
rdplyrdata.tabletidyverseplyr

Efficiently Repeating Observations by Group


I am trying to find an efficient way to repeat rows by group in data.table for only some groups. Please consider the following example:

library(data.table) 

DT <- data.table(x = c("A","A", "B", "B", "C","C", "D","D"),
                 y = 1:8)

This dataset looks like:

head(DT)
   x y
1: A 1
2: A 2
3: B 3
4: B 4
5: C 5
6: C 6

Say I have a separate vector rep <- c("A", "A", "A", "B", "B", "C"). Given this vector, I want to be able to repeat all rows of A three times (due to the cardinality of the "A" characters in rep) and all rows associated with B two times. Thus, the final dataset should like:

    x y
1:  A 1
2:  A 2
3:  A 1
4:  A 2
5:  A 1
6:  A 2
7:  B 3
8:  B 4
9:  B 3
10: B 4
11: C 5
12: C 6

Notice that I did not repeat "C" because the cardinality of "C" is only 1 in rep. I have a hackish way of doing this procedure at the moment, but I'm wondering if there was a more efficient data.table way of doing the above.

Thank you!

P.S. The reason I am doing this is because I am doing some matching with replacement in my regressions and sometimes, the same control firm is assigned to more than one treatment firm.


Solution

  • A data.table merge won't give you the same ordering but you aren't supposed to rely on ordering in datatables, anyway:

    merge(DT, data.frame(x=rep), by="x")
    
        x y
     1: A 1
     2: A 1
     3: A 1
     4: A 2
     5: A 2
     6: A 2
     7: B 3
     8: B 3
     9: B 4
    10: B 4
    11: C 5
    12: C 6