Search code examples
mysqlrdplyrrmysql

Connection lost when connect R to MySQL using collect


I would like to use dplyr and RMySQL to work with my large data. There is no issues with dplyr code. The problem (I think) is about exporting data out of MySQL to R. My connection is dropped every time even I am using n=Inf in collect. Theoretically, my data should have more than 50K rows, but I can only get around 15K back. Any suggestions are appreciated.

Approach 1

library(dplyr)
library(RMySQL)

# Connect to a database and select a table 
my_db <- src_mysql(dbname='aermod_1', host = "localhost", user = "root", password = "")   
my_tbl <- tbl(my_db, "db_table") 
out_summary_station_raw <- select(my_tbl, -c(X, Y, AVERAGE_CONC))
out_station_mean_local <- collect(out_summary_station_raw)

Approach 2: using Pool

library(pool)
library(RMySQL)
library(dplyr)

pool <- dbPool(
  drv = RMySQL::MySQL(),
  dbname = "aermod_1",
  host = "localhost",
  username = "root",
  password = ""
)

out_summary_station_raw <- src_pool(pool) %>% tbl("aermod_final") %>% select(-c(X, Y, AVERAGE_CONC))

out_station_mean_local <- collect(out_summary_station_raw, n = Inf)

Warning message (both approaches):

Warning messages:
1: In dbFetch(res, n) : error while fetching rows
2: Only first 15,549 results retrieved. Use n = Inf to retrieve all. 

Update:

Checked log and it looks fine from the server side. For my example, the slow-log said Query_time: 79.348351 Lock_time: 0.000000 Rows_sent: 15552 Rows_examined: 16449696, but collect just could not retrieve the full data. I Am able to replicate the same move using MySQL Bench.


Solution

  • After the most recent RMySQL update, I noticed that I could not collect() data from large views, and I reported it as an issue. Your problem might be related.

    One thing to try is to roll back to the last version.

    devtools::install_version("RMySQL", version = "0.10.9", 
                              repos = "http://cran.us.r-project.org")