Search code examples
rdplyrdbplyr

dbplyr generating unexpected SQL query


I am sending a simple query to a MySQL database using dbplyr:

library(dbplyr)

my_conn<-dbConnect(...)
tab<-tbl(my_conn, "myTable")

tab %>% select(id, date, type) %>%
    filter(type = 'foobar')

However, if I inspect the generated SQL with show_query(), I get this:

SELECT *
FROM (SELECT `id`, `date`, `type`
FROM `myTable`) `q01`
WHERE (`type` == 'foobar')

This query is very slow to execute.

If I instead execute the following SQL command on the server:

SELECT id, date, type FROM db.myTable WHERE type = 'foobar'

then the return is nearly instantaneous.

My question is: why is dbplyr doing SELECT * (i.e. select all) and then doing a nested select in line 2? Also, why has "q01" appeared in this query? Could this be why my query is very slow to run compared to executing the minimal command directly on the server? I would perhaps expect dbplyr to create inefficient SQL queries as the complexity of the operations increases but--in this case at least--I can't really write a more succinct set of operations. It's just a select and filter (SELECT and WHERE).


Solution

  • dbplyr is generating the SQL query as I would expect. What it has done is one query inside another:

    SELECT id, date, type FROM myTable
    

    Is a subquery in the super query

    SELECT *
    FROM (
       subquery
    ) q01
    WHERE type = foobar
    

    The q01 is the name given to the subquery. In the same way as the AS keyword. For example: FROM very_long_table_name AS VLTN.

    Yes, this nesting is ugly. But many SQL engines have a query optimizer that calculates the best way to execute a query. On SQL Server, I have noticed little difference in performance because the query optimizer finds a faster way to execute than as written.

    However, it appears that for MySQL, nested queries are known to result in slower performance. See here, here, and here.

    One thing that might solve this is changing the order of the select and filter commands in R:

    tab %>%
      filter(type = 'foobar') %>%
      select(id, date, type)
    

    Will probably produce the translated query:

    SELECT `id`, `date`, `type`
    FROM `myTable`
    WHERE (`type` == 'foobar')
    

    Which will perform better.