I need an R package or function that will allow me to match controls to cases for a large dataset, 5 million subjects. I have tried a few packages, my problems are summarized below. I only tried to match on a single covariate and I most likely will need to match on several.
Package MatchIt
: The nearest neighbor, optimal, and genetic methods all just run for hours and hours. The "cem"
method runs really quickly but I need to know which cases were matched/unmatched so I can do further analysis with the matched subset. Running the match.data()
on the cem results only supplies the weights to be used in a regression and not the matched subset. The paired function in cem would work if I wanted one to one matching but I want to retain as many controls as possible.
matchControls()
in the e1071
package: runs for a long time and them returns "not able to allocate vector of size 1352 GB"
Match()
function from Matching
package: Just runs and runs...
quickmatch()
from the quickmatch
package: It ran quickly but I am not sure I'm using the function correctly or how to extract the matched data from the "qm_matching"
object returned. Below is my attempt using quickmatch
on fake data.
library(MatchIt)
library(cem)
library(Matching)
library(rgenoud)
library(quickmatch)
set.seed(100)
control_df=data.frame(Group=factor("Control"),value=rnorm(1400000,95,2))
set.seed(101)
treatment_df=data.frame(Group=factor("Treatment"),value=c(rnorm(500000,92,2),rnorm(100000,50,5)))
dat=rbind(control_df,treatment_df)
covariate_balance(dat$Group, dat$value, matching = NULL,
normalize = TRUE, all_differences = TRUE)
my_distances <- distances(dat, dist_variables = c("value"))
matchedDat=quickmatch(my_distances,dat$Group )
matchedDat.df=data.frame(matchedDat)
Not sure what to do with the returned object. I think quickmatch
may be the most viable option. The covariate_balance
result shows a decent amount of imbalance between the Control and Treatment groups so some amount of matching can be done.
Specifically how do I obtain matched results,i.e. flag the subjects that were successfully matched between the Control and Treatment? The cluster_label
from matchedDat.df
implies that the function is creating a large number of clusters how/can I restrict this?
Any help with respect to speeding up some of the functions above or new suggestions would be appreciated.
After a more careful reading of the cem
documentation I think I have the solution to my problem using the Matchit
package or the cem
package.
library(cem)
library(tidyverse)
set.seed(100)
control_df=data.frame(Group=factor("Control"),value=rnorm(1400000,95,2))
set.seed(101)
treatment_df=data.frame(Group=factor("Treatment"),value=c(rnorm(500000,92,2),rnorm(100000,50,5)))
dat=rbind(control_df,treatment_df)%>% rownames_to_column()
cem.match=cem(treatment="Group", baseline.group="Control",data=dat,keep.all=TRUE, drop ="rowname")
matchedData=data.frame(Group.check=cem.match$groups, matched=cem.match$matched,weights=cem.match$w)%>%
rownames_to_column()%>%
inner_join(dat,by="rowname") %>%
filter(matched==TRUE)