Search code examples
sqlrdatabasemysql

Import MySQL dump into R (without requiring MySQL server)


Packages like RMySQL and sqldf allow one to interface with local or remote database servers. I'm creating a portable project which involves importing sql data in cases (or on devices) which do not always have access to a running server, but which do always have access to the latest .sql dump of the database.

The goal seems simple enough: import an .sql dump into R without the involvement of a MySQL server. More specifically, I'd like to create a list of lists in which the elements correspond to any databases defined in the .sql dump (there may be multiple), and those elements in turn consist of the tables in those databases.

To make this reproducible, let's take the sample sportsdb SQL file here — if you unzip it it's called sportsdb_sample_mysql_20080303.sql.

One would think sqldf might be able to do it:

read.csv.sql('sportsdb_sample_mysql_20080303.sql', sql="SELECT * FROM addresses")
Error in sqliteSendQuery(con, statement, bind.data) : 
  error in statement: no such table: addresses

This even though there certainly is a table addresses in the dump. This post on the sqldf list mentions the same error, but no solution.

Then there is an sql.reader function in the package ProjectTemplate, which looks promising. Poking around, the source for the function can be found here, and it assumes a running database server and relies on RMySQL — not what I need.

So... we seem to be running out of options. Any help from the hivemind appreciated!

(To reiterate, I am not looking for a solution that relies on access to an SQL server; that's easy with dbReadTable from the RMySQL package. I would very much like to bypass the server and get the data straight from the .sql dump file.)


Solution

  • depending on what you want to extract from the table, here is how you can play around with the data

    numLines <- R.utils::countLines("sportsdb_sample_mysql_20080303.sql")
    # [1] 81266
    
    linesInDB <- readLines("sportsdb_sample_mysql_20080303.sql",n=60)
    

    Then you can do some regex to get tables names (after CREATE TABLE), column names (between first brackets) and VALUES (lines after CREATE TABLE and between second brackets)

    Reference: Reverse engineering a mysqldump output with MySQL Workbench gives "statement starting from pointed line contains non UTF8 characters" error


    EDIT: in response to OP's answer, if i interpret the python script correct, it is also reading it line by line, filter for INSERT INTO lines, parse as csv, then write to file. This is very similar to my original suggestion. My version below in R. If the file size is too large, it would be better to read in the file in chunks using some other R package

    options(stringsAsFactors=F)
    library(utils)
    library(stringi)
    library(plyr)
    
    mysqldumpfile <- "sportsdb_sample_mysql_20080303.sql"
    
    allLines <- readLines(mysqldumpfile)
    insertLines <- allLines[which(stri_detect_fixed(allLines, "INSERT INTO"))]
    allwords <- data.frame(stri_extract_all_words(insertLines, " "))
    d_ply(allwords, .(X3), function(x) {
        #x <- split(allwords, allwords$X3)[["baseball_offensive_stats"]]
        print(x[1,3])
    
        #find where the header/data columns start and end
        valuesCol <- which(x[1,]=="VALUES")
        lastCols <- which(apply(x, 2, function(y) all(is.na(y))))
        datLastCol <- head(c(lastCols, ncol(x)+1), 1) - 1
    
        #format and prepare for write to file
        df <- data.frame(x[,(valuesCol+1):datLastCol])
        df <- setNames(df, x[1,4:(valuesCol-1)])
        #type convert before writing to file otherwise its all strings
        df[] <- apply(df, 2, type.convert)
        #write to file
        write.csv(df, paste0(x[1,3],".csv"), row.names=F)
    })