I'm working on a process improvement that will use SQL in r to work with large datasets. Currently the source data is stored in several different MS Access databases. My initial approach was to use RODBC to read all of the source data into r, and then use sqldf() to summarize the data as needed. I'm running out of RAM before I can even begin use sqldf() though.
Is there a more efficient way for me to complete this task using r? I've been looking for a way to run a SQL query that joins the separate databases before reading them into r, but so far I haven't found any packages that support this functionality.
Should your data be in a database dplyr
(a part of the tidyverse) would be the tool you are looking for.
You can use it to connect to a local / remote database, push your joins / filters / whatever there and collect()
the result as a data frame. You will find the process neatly summarized on http://db.rstudio.com/dplyr/
What I am not quite certain of - but it is not a R issue but rather an MS Access issue - is the means for accessing data across multiple MS Access databases.
You may need to write custom SQL code for that & pass it to one of the databases via DBI::dbGetQuery()
and have MS Access handle the database link.