Search code examples
rrsqlitedbplyr

Update SQLite database without downloading data locally


I have the following 2 dataframes:

library(tidyverse)
library(RSQLite)

df1 <- data.frame(user_id=c("A","B","C"), 
                  transaction_date=c("2019-01-01","2019-01-01","2019-01-01"))

df2 <- data.frame(user_id=c("C","D","E"), 
                  transaction_date=c("2019-01-03","2019-01-03","2019-01-03"))

df1
df2

#    user_id    transaction_date
#    <fct>      <fct>
#    A          2019-01-01
#    B          2019-01-01
#    C          2019-01-01

#    user_id    transaction_date
#    <fct>      <fct>
#    C          2019-01-03
#    D          2019-01-03
#    E          2019-01-03

I would like to find the minimum transaction date for each user_id. I could do it like so:

rbind(df1, df2) %>%
group_by(user_id) %>%
summarise(min_dt=min(transaction_date %>% as.Date()))

#    user_id    min_dt
#    <fct>      <date>
#    A          2019-01-01
#    B          2019-01-01
#    C          2019-01-01
#    D          2019-01-03
#    E          2019-01-03

The problem is that I have 100s of dataframes (1 per day) and millions of rows per dataframe. And the list of user_id's grows each time I introduce new user_id's and calculate the min_dt. So the whole process becomes very slow over time. Questions: 1) Would it be faster to run the computations in SQLite? 2) If so how can I accomplish this without downloading the data locally each time?

Here's what I've tried.

Step 1: Create database from df1:

db <- dbConnect(SQLite(), dbname = "user_db.sqlite")

dbWriteTable(conn = db, name = "first_appeared", value = df1, append=TRUE)
tbl(db, "first_appeared")

##  Source:   table<first_appeared> [?? x 2]
##  Database: sqlite 3.29.0 [user_db.sqlite]
#   user_id transaction_date
#   <chr>   <chr>           
# 1 A       2019-01-01      
# 2 B       2019-01-01      
# 3 C       2019-01-01 

Step 2: Append df2:

dbWriteTable(conn = db, name = "first_appeared", value = df2, append=TRUE)
tbl(db, "first_appeared")

##  Source:   table<first_appeared> [?? x 2]
##  Database: sqlite 3.29.0 [/Volumes/GoogleDrive/My Drive/Ad hoc/201908 v2
#   mapper/user_db.sqlite]
#   user_id transaction_date
#   <chr>   <chr>           
# 1 A       2019-01-01      
# 2 B       2019-01-01      
# 3 C       2019-01-01      
# 4 C       2019-01-03      
# 5 D       2019-01-03      
# 6 E       2019-01-03 

Step 3: Calculate min_dt in SQLite

tbl(db, "first_appeared") %>%
group_by(user_id) %>%
summarise(first_appeared=min(transaction_date))

dbDisconnect(db)            # Close connection

##  Source:   lazy query [?? x 2]
##  Database: sqlite 3.29.0 [/Volumes/GoogleDrive/My Drive/Ad hoc/201908 v2
##  mapper/user_db.sqlite]
#   user_id first_appeared
#   <chr>   <chr>         
# 1 A       2019-01-01    
# 2 B       2019-01-01    
# 3 C       2019-01-01    
# 4 D       2019-01-03    
# 5 E       2019-01-03  

Step 4: How do I transfer these results directly to the database (overwrite the database) without first downloading the data locally?


Solution

  • General approach

    Let me begin with the general approach I would use: Updating a 'latest' table with each new day.

    update = function(existing_table, new_day_table){
      new_existing_table = existing_table %>%
        full_join(new_day_table, by = "user_id", suffix = c("_exist","_new") %>%
        mutate(transaction_date = ifelse(test = !is.na(transaction_date_exist)
                                                & (is.na(transaction_date_new)
                                                   | transaction_date_exist < transaction_date_new ), 
                                         yes = transaction_date_exist,
                                         no = transaction_date_new)) %>%
        select(user_id, transaction_date)
    }
    

    Within R, you would run this function each day:

    existing_table = update(existing_table, next_day_table)
    

    I recommend this approach because at each calculation you only need two tables: the table you are storing all the details in, and the table you are using to update it. This is significantly less data to process than all the daily data files for each update.

    Suppose you want to do this within SQLite?

    The code within my update function above should translate easily via dbplyr from R into SQLite. Assuming both existing_table and next_day_table are already in SQLite.

    However dbplyr does not save the resulting table as an object. So, if you call new_table = update(existing_table, next_day_table) then new_table will be defined by the SQL commands that dbplyr uses to construct it.

    To save it as a table you probably need something like the following:

    sql_query = paste("CREATE TABLE new_first_appeared AS\n"
                      ,as.character(sql_render(new_table))
    )
    dbExecute(db_connection, sql_query)
    

    Note that you have to write to new_first_appeared in the database. You can not directly overwrite first_appeared as the definition for new_first_appeared depends on first_appeared. You then have to delete the existing table first_appeared, and rename new_first_appeared.

    What else could you consider?

    Depending on your context, the existing records in your first_appeared table may not change once created. If this is the case then rather than rewriting the entire table you want to instead look at identifying just the new records and appending them to the existing table.

    For this you might want the INSERT INTO first_appeared_table SELECT * FROM table pattern within SQLite. You will also need to change your query to return just the new records.