I want to estimate the effect of treatment X on variable Y by matching for covariates balance on treatment and control groups using R and the MatchIt
package.
I'm compiling a retrospective cohort, and the treatment-time varies across the treatment cases. Moreover, I have multiple covariates (COV_A, COV_B...) that depend on the treatment time. I use a large database to mine controls and query the dependent covariates for a given treatment time. This is a large sample with thousand of treated cases, tens of thousands of potential controls, and many covariates.
To achieve this, I used SQL query to manually perform an "exact match" on some of the covariates as a kind of "initial matching" (for example, checking which controls have been monitored long enough to be treated in a given time). This initial step resulted in a table with multiple rows of potential control cases to match each treated case (TREAD_ID). For each row/case of potential control, I mined the time-depended covariates respecting the treated case treatment time.
The result is a table of potential controls that are stratified for each treatment case. This means that a control case can appear more than once with a different or the same treatment time, and the covariates change accordingly.
My intention is to use the matchit
function to perform some kind of distance matching inside a stratum matching using method = "nearest"
and exact="TREAT_ID"
for example.
CONTROL_ID | TREAT_ID | TREATMENT_TIME | COV_A | COV_B |
---|---|---|---|---|
C-1 | T-1 | 1.5 | 0.6 | 185 |
C-2 | T-1 | 1.5 | 0.7 | 123 |
C-3 | T-1 | 1.5 | 0.8 | 182 |
C-4 | T-1 | 1.5 | 0.6 | 185 |
C-1 | T-2 | 2.2 | 0.9 | 160 |
C-2 | T-2 | 2.2 | 1.4 | 150 |
C5 | T-2 | 2.2 | 0.9 | 48 |
C-6 | T-2 | 2.2 | 3.3 | 113 |
* Notice that controls C-1 and C-2 appears twice...
I want to do matching "without replacement" (each control unit is matched to only one treated unit) - How can I achieve this if the initial table contains duplicates of the same control cases (some of which with different values for covariates)?
I also want to be able to:
(Maybe my whole attitude to the problem is wrong, I'll also be happy to hear different solutions...)
TL;DR: I used @Noah's suggestion and the unit.id
argument.
I united the treated cases into the stratified control cases from the example in the question and added the MATCHING_STRATA
and MATCHING_CASE
columns:
ID | MATCHING_STRATA | MATCHIN_CASE | TREATMENT_TIME | COV_A | COV_B |
---|---|---|---|---|---|
T-1 | T-1 | TREATED | 1.5 | 1.2 | 112 |
C-1 | T-1 | CONTROL | 1.5 | 0.6 | 185 |
C-2 | T-1 | CONTROL | 1.5 | 0.7 | 123 |
C-3 | T-1 | CONTROL | 1.5 | 0.8 | 182 |
C-4 | T-1 | CONTROL | 1.5 | 0.6 | 185 |
T-2 | T-2 | TREATED | 2.2 | 1.6 | 140 |
C-1 | T-2 | CONTROL | 2.2 | 0.9 | 160 |
C-2 | T-2 | CONTROL | 2.2 | 1.4 | 150 |
C-5 | T-2 | CONTROL | 2.2 | 0.9 | 48 |
C-6 | T-2 | CONTROL | 2.2 | 3.3 | 113 |
And then used the matchit
function with exact="MATCHING_STRATA"
to look into each stratum individually and unit.id="ID"
to declare no replacement all across strata:
MatchIt::matchit(MATCHING_CASE ~ COV_A + COV_B,
data = df,
method = "nearest",
exact="MATCHING_STRATA",
unit.id="ID",
replace = FALSE)