I have the following 2 dataframes:
library(tidyverse)
library(RSQLite)
df1 <- data.frame(user_id=c("A","B","C"),
transaction_date=c("2019-01-01","2019-01-01","2019-01-01"))
df2 <- data.frame(user_id=c("C","D","E"),
transaction_date=c("2019-01-03","2019-01-03","2019-01-03"))
df1
df2
# user_id transaction_date
# <fct> <fct>
# A 2019-01-01
# B 2019-01-01
# C 2019-01-01
# user_id transaction_date
# <fct> <fct>
# C 2019-01-03
# D 2019-01-03
# E 2019-01-03
I would like to find the minimum transaction date for each user_id. I could do it like so:
rbind(df1, df2) %>%
group_by(user_id) %>%
summarise(min_dt=min(transaction_date %>% as.Date()))
# user_id min_dt
# <fct> <date>
# A 2019-01-01
# B 2019-01-01
# C 2019-01-01
# D 2019-01-03
# E 2019-01-03
The problem is that I have 100s of dataframes (1 per day) and millions of rows per dataframe. And the list of user_id's grows each time I introduce new user_id's and calculate the min_dt. So the whole process becomes very slow over time. Questions: 1) Would it be faster to run the computations in SQLite? 2) If so how can I accomplish this without downloading the data locally each time?
Here's what I've tried.
Step 1: Create database from df1:
db <- dbConnect(SQLite(), dbname = "user_db.sqlite")
dbWriteTable(conn = db, name = "first_appeared", value = df1, append=TRUE)
tbl(db, "first_appeared")
## Source: table<first_appeared> [?? x 2]
## Database: sqlite 3.29.0 [user_db.sqlite]
# user_id transaction_date
# <chr> <chr>
# 1 A 2019-01-01
# 2 B 2019-01-01
# 3 C 2019-01-01
Step 2: Append df2:
dbWriteTable(conn = db, name = "first_appeared", value = df2, append=TRUE)
tbl(db, "first_appeared")
## Source: table<first_appeared> [?? x 2]
## Database: sqlite 3.29.0 [/Volumes/GoogleDrive/My Drive/Ad hoc/201908 v2
# mapper/user_db.sqlite]
# user_id transaction_date
# <chr> <chr>
# 1 A 2019-01-01
# 2 B 2019-01-01
# 3 C 2019-01-01
# 4 C 2019-01-03
# 5 D 2019-01-03
# 6 E 2019-01-03
Step 3: Calculate min_dt in SQLite
tbl(db, "first_appeared") %>%
group_by(user_id) %>%
summarise(first_appeared=min(transaction_date))
dbDisconnect(db) # Close connection
## Source: lazy query [?? x 2]
## Database: sqlite 3.29.0 [/Volumes/GoogleDrive/My Drive/Ad hoc/201908 v2
## mapper/user_db.sqlite]
# user_id first_appeared
# <chr> <chr>
# 1 A 2019-01-01
# 2 B 2019-01-01
# 3 C 2019-01-01
# 4 D 2019-01-03
# 5 E 2019-01-03
Step 4: How do I transfer these results directly to the database (overwrite the database) without first downloading the data locally?
Let me begin with the general approach I would use: Updating a 'latest' table with each new day.
update = function(existing_table, new_day_table){
new_existing_table = existing_table %>%
full_join(new_day_table, by = "user_id", suffix = c("_exist","_new") %>%
mutate(transaction_date = ifelse(test = !is.na(transaction_date_exist)
& (is.na(transaction_date_new)
| transaction_date_exist < transaction_date_new ),
yes = transaction_date_exist,
no = transaction_date_new)) %>%
select(user_id, transaction_date)
}
Within R, you would run this function each day:
existing_table = update(existing_table, next_day_table)
I recommend this approach because at each calculation you only need two tables: the table you are storing all the details in, and the table you are using to update it. This is significantly less data to process than all the daily data files for each update.
The code within my update function above should translate easily via dbplyr
from R into SQLite. Assuming both existing_table
and next_day_table
are already in SQLite.
However dbplyr
does not save the resulting table as an object. So, if you call new_table = update(existing_table, next_day_table)
then new_table
will be defined by the SQL commands that dbplyr
uses to construct it.
To save it as a table you probably need something like the following:
sql_query = paste("CREATE TABLE new_first_appeared AS\n"
,as.character(sql_render(new_table))
)
dbExecute(db_connection, sql_query)
Note that you have to write to new_first_appeared
in the database. You can not directly overwrite first_appeared
as the definition for new_first_appeared
depends on first_appeared
. You then have to delete the existing table first_appeared
, and rename new_first_appeared
.
Depending on your context, the existing records in your first_appeared
table may not change once created. If this is the case then rather than rewriting the entire table you want to instead look at identifying just the new records and appending them to the existing table.
For this you might want the INSERT INTO first_appeared_table SELECT * FROM table
pattern within SQLite. You will also need to change your query to return just the new records.