Search code examples
mysqlrdatabasedplyrdbplyr

How to manipulate/clean data located in a MySQL database using base R commands?


I've connected to a MySQL database using the RMariaDB package, and, thanks to the dbplyr package, am able to adjust the data using dplyr commands directly from R studio. However, there are some basic things I want to do that require base R functions (there are no equivalents in dplyr to my knowledge). Is there a way to clean this data using base R commands? Thanks in advance.


Solution

  • The answer to this arises from how the dbplyr package works. dbplyr translates certain dplyr commands into SQL. For example:

    library(dplyr)
    library(dbplyr)
    data(mtcars)
    
    # setup simulated database connection
    df_postgre = tbl_lazy(mtcars, con = simulate_postgres())
    
    # fetch first 5 records
    first_five = df_postgre %>% head(5)
    
    # view SQL transaltion
    first_five %>% show_query()
    
    # resulting SQL translation
    <SQL>
    SELECT *
    FROM `df`
    LIMIT 5
    

    The major constrain for this approach is that dbplyr can only translate certain commands into SQL. So something like the following will fail:

    # setup simulated database connection
    df_postgre = tbl_lazy(mtcars, con = simulate_postgres())
    
    # fetch first 5 records
    first_five = df_postgre[1:5,]
    

    While head(df, 5) and df[1:5,] produce identical output for data.frames in local R memory, dbplyr can not translate developer intention, only specific dplyr commands. Hence these two commands are very different when working with database tables.

    The other element to consider here is that databases are primarily read-only. In R we can do

    df = df %>%
      mutate(new_var = 2*old_var)
    

    and this changes the data held in memory. However in databases, the original data is stored in the database and it is transformed based on your instructions when it is requested. There are ways to write completely new database tables from existing database tables - there are already several Q&A on this under the dbplyr tag.