I'm pulling data from a number of different tables and joining them to a base table as shown below:
df <- tbl(ch, dbplyr::in_schema("schem1", 'education_data')) %>%
select(ID, ProviderKey,ReportingPeriodKey,EthnicityKey) %>%
left_join(
tbl(ch, dbplyr::in_schema("data_schem", 'ReportingPeriodDimensions')) %>%
select(ReportingPeriodKey,ReportingYear),
by = "ReportingPeriodKey"
) %>%
left_join(
tbl(ch, dbplyr::in_schema("data_schem", 'ProviderDimensions')) %>%
select(ProviderKey, ProviderName),
by = "ProviderKey"
) %>%
left_join(
tbl(ch, dbplyr::in_schema("sdr", 'EthnicityDim')) %>%
select(EthnicityKey,Ethnicity_description),
by = "EthnicityKey"
) %>%
filter(ReportingYear %in% ("2015","2016","2017"))
When I check the SQL code with show_query(df)
, there is a 'SELECT' SQL command of the base table 'education_data' and all it's variables selected for each of the joins performed, which is 3x in this example. I often need to join around 8 or 9 other tables depending on the number of dimensions tables I need to join.
This seems very inefficient, especially when I'm pulling tables with records that run into the millions. Is there any work around for that using R commands rather than coding SQL directly?
I am guessing that instead of seeing tidy SQL queries that look like:
SELECT a.col1, a.col2, b.col3, c.col4
FROM table_A AS a
LEFT JOIN table_B AS b
ON a.col = b.col
LEFT JOIN table_C AS c
ON b.col = c.col
You are instead seeing nested queries that look more like:
SELECT LHS.col1, LHS.col2, LHS.col3, RHS.col4
FROM (
SELECT LHS.col1, LHS.col2, RHS.col3
FROM table_A AS LHS
LEFT JOIN table_B AS RHS
ON LHS.col = RHS.col
) AS LHS
LEFT JOIN table_C AS RHS
ON LHS.col = RHS.col
But with much worse layout.
I agree that the first example is nicer formatted, easier to read, and good coding practice. In contrast the autogenerated SQL code by dbplyr is rather ugly.
In general I do not worry about this. Human written SQL code is intended to be human and machine readable. Autogenerated SQL code from dbplyr is mainly intended to be machine readable. Your SQL interpreter will most likely execute the two variations in the same way. As the complexity of my work with dbplyr increases, so to does my trust that the SQL interpreter can find an efficient way to execute the query.
I have never noticed major performance differences between autogenerated SQL from dbplyr and equivalent, human-tidied code. But I have not tested this extensively. If you do run such tests, they would be an excellent answer to add to this question.
Some things you can do if you are concerned about performance:
As suggested by @danblack, ensure your tables have indices on the columns you are joining by. Perhaps a clustered index on the most important column. One way to do this without leaving R is DBI::dbExecute("sql string for adding an index goes here")
which will execute the command you give it in the database.
Write out key intermediate tables into the database & index. For example if table A is made by joining tables B an C, then executing a self-join of A with A will involve three joins ( (B-C)-(B-C) ) if working from tables B and C directly, but only one join if you have written table A to the database ( A-A ).
Depending on the context, you could partition each table prior to joining and then append the output tables together. For example, use a loop in R to run one query for each reporting year. Running joins on smaller sets of input data tends to perform faster.