Search code examples
rdplyrdbplyr

dbplyr filter() and lazy evaluation


I have a large MySQL table (92 cols, 3 million rows) that I'm wrangling in R with dbplyr. I'm new to the package and had a question or two about dbplyr's lazy evaluation as I'm experiencing some considerable slowness.

Suppose I have connected to my database and wish to perform a select operation to reduce the number of columns:

results<- my_conn_table %>%
            select(id, type, date)

Despite there being millions of rows, viewing results is near instantaneous (only showing 3 rows for clarity):

> results

# Source:   lazy query [?? x 3]
# Database: mysql []
      id type     date   
   <int> <chr>    <date>    
 1     1 Customer 1997-01-04 
 2     2 Customer 1997-02-08 
 3     3 Business 1997-03-29 
 ...
 

However, if I attempt to perform a filter operation with:

results<- my_df %>%
            select(id, type, date) %>%
            filter(type == "Business")

the operation takes a very long time to process (tens of minutes in some instances). Is this long processing time a simple function of nrow ~= 3 million? (In other words, there's nothing I can do about it because the dataset is large.) Or is someone able to suggest some general ways to perhaps make this more performant?

My initial understanding of the lazy evaluation was that the filter() query would only return the top few rows, to prevent the long run time scenario. If I wanted all the data I can run collect() to gather the results into my local R session (which I would expect to take a considerable amount of time depending on the query.)


Solution

  • Building on @NovaEthos's answer, you can call show_query(results) to get the SQL query that dbplyr generated and is passing to the database. Posting this query here will make it clear whether there is any inefficiency in how the database is being queried.

    A further thing to investigate is how your data is indexed. Like an index in a book, an index on a database table helps find records faster.

    You might only be asking for 1000 records with type = 'business' but if these records only start from row 2 million, then the computer has to scan two-thirds of your data before it finds the first record of interest.

    You can add indexes to existing database tables using something like the following:

    query <- paste0("CREATE NONCLUSTERED INDEX my_index_name ON", database_name, ".", schema_name, ".", table_name, "(", column_name, ");")
    
    DBI::dbExecute(db_connection, as.character(query))
    

    Note that this syntax is for SQL Server. Your database may require slightly different syntax. In practice I wrap this in a function with additional checks such as 'does the indexing column exist in the table?' this is demonstrated here.