I am running the coarsened exact matching (cem) matching method from the MatchIt package on a dataset that contains ~18,300 rows (i.e., one row for each patient). I am matching patient cases on two covariates. The two covariates are diagnosis age (which has a range of 0 to 76) and current age (which has a range of 1 to 90).
I do not want to do 1:1 matching; rather, my goal is to minimize data loss by matching as many patients as possible.
My question arises from trying to manage the trade-off between exact and approximate balancing. I want the diagnosis and current ages of matched individuals to vary by no more than 2 years total. (If very rare instances of 3 years' difference is unavoidable, that's OK -- but by and large, I want to keep the difference to no more than two years.) The two years' difference can be two years' difference in diagnosis age, two years' difference in current age, or two total years' difference between diagnosis age and current age. The idea is that I want to try to match individuals' disease duration across these two groups.
I've tried a number of different arguments for the cutpoints parameter. Here is one example:
matchit <- matchit(Group ~ Last_recorded_age + Diagnosis_age,
data = df,
method = 'cem',
cutpoints = list(Current_age = 44, Diagnosis_age = 38),
)
This divides current age into 44 bins, so each bin generally contains a range of 2 years. And diagnosis age is split into bins that each contain 2 years. When I run this, all but 383 rows are assigned a subclass. When I use match.data() to view the rows that were assigned a subclass, I can see there are only 25 rows where the diagnosis and current ages vary by 3 years. The rest vary by no more than 2 years. So that's good -- because that's what I want.
But when I look at the 383 rows that were not assigned a subclass, I see that there are cases that were not assigned a subclass that I would have expected would have been assigned a subclass, because they are so similar to cases that were assigned a subclass. For example, one of the subclasses contains a pair of individuals where the treated individual has a current age of 31 and a diagnosis age of 28 and the control individual has a current age of 30 and a diagnosis age of 29. But then I see that there is an unmatched control individual who has a current age of 31 and a diagnosis age of 30. I'm wondering why that person was not assigned to the subclass I just mentioned?
Is there is a better way to define the cutpoints so that I match as many individuals as possible, while minimizing variance between the two groups?
The simple reason for the phenomenon with the seemingly close unmatched units is that they do not fall into the same strata as units with the other treatment value. CEM is not concerned with closeness; it only considers bins of the covariate space. Binning is not always the best solution for the reasons you mention: two units could be extremely close but be separated by a bin boundary, placing them into different strata or leaving some unmatched.
An alternative solution would be to use a caliper, which sounds more like what you want anyway. You can use a caliper with full matching to retain as many units as possible. Full matching creates strata with either exactly one treated unit or exactly one control unit by minimizing the total within-stratum distances between the treated and control units. A caliper ensures that the distance between the treated and control units in each stratum are controlled. You can set a caliper on the covariates directly. For example, to ensure units within strata are not farther than 2 years of age and 2 years of distance between age and diagnosis age, you would use the following code:
matchit <- matchit(Group ~ Last_recorded_age + Diagnosis_age,
data = df, distance = "mahalanobis",
method = "full",
caliper = c(Last_recorded_age = 2, Diagnosis_age = 2),
std.caliper = FALSE)
With so many units, this would normally be slow, but with calipers so tight, it may not take so long.
Otherwise, my only advice for dealing with CEM is to manually adjust the cutpoints until you get a sample you like. Don't forget that you can supply the cutpoints yourself, not just the number of bins.