Search code examples
rrsqlite

error rsqlite_fetch in R


I have a "CSV" format dataset with 28 variables and 7400 observation. I import the dataset in R with the name of "films" and I want to clean that. In this way, I use "sqldf" and "SQLite" libraries. But when I use "sqldf" I receive the warnings and consequently my dataset does not update too:

enter image description here

color   director_name   num_critic_for_reviews  duration
Color   James Cameron   723                       178
Color   Gore Verbinski  302                       169
Color   Sam Mendes      602                       148

the code I have used is as below:

library(RSQLite) 
library(sqldf)
db <- dbConnect(SQLite(), "tempdb")  

dbWriteTable(db,"films",films, overwrite=TRUE)

d <- sqldf(c('update films set movie_title=lower(movie_title)', 'select * from films'))
d <- sqldf(c('update films set actor_3_name=lower(actor_3_name)', 'select * from films'))
d <- sqldf(c('update films set actor_2_name=lower(actor_2_name)', 'select * from films'))
d <- sqldf(c('update films set actor_1_name=lower(actor_1_name)', 'select * from films'))
d <- sqldf(c('update films set director_name=lower(director_name)', 'select * from films'))


# Warning message:
# In rsqlite_fetch(res@ptr, n = n) :
#   Don't need to call dbFetch() for statements, only for queries

What is the problem?


Solution

  • There are several problems at work here.

    First:

    # Warning message:
    # In rsqlite_fetch(res@ptr, n = n) :
    #   Don't need to call dbFetch() for statements, only for queries
    

    This is a warning, not an error. In fact, it appears to be somewhat new issue in RSQLite, and is referenced in an open issue: https://github.com/r-dbi/RSQLite/issues/227.

    (For cleanness, I'm removing it from the output in the rest of this answer, though it occurs with each non-select query.)

    Second, RSQLite deals with a database, period. It has no opinion or awareness of variables within the R environment, so there is no suggestion that a variable in R is an immediate and perpetual representation of a database table. (There are similar methodologies using dbplyr that do do this, roughly speaking.)

    In order to provide this type of tie between R and some form of SQL querying, there's sqldf that allows you to query against R variables as if they were actual SQL tables. When you do such a query, it grabs the data.frame as it currently looks, inserts it into a temporary database table (whether RSQLite or another), runs the SQL code, then returns what you need.

    Third: despite this apparent tie, it is quite functional in that it does not produce side-effects within the R environment. That means that if you want to store the resulting data in a way that R can use, you need to explicitly capture the new table into an R variable.

    For example:

    library(sqldf)
    (mt <- mtcars[1:5,1:5])
    #                    mpg cyl disp  hp drat
    # Mazda RX4         21.0   6  160 110 3.90
    # Mazda RX4 Wag     21.0   6  160 110 3.90
    # Datsun 710        22.8   4  108  93 3.85
    # Hornet 4 Drive    21.4   6  258 110 3.08
    # Hornet Sportabout 18.7   8  360 175 3.15
    

    After the update, the original data is untouched.

    sqldf('update mt set cyl=5 where cyl>5')
    mt
    #                    mpg cyl disp  hp drat
    # Mazda RX4         21.0   6  160 110 3.90
    # Mazda RX4 Wag     21.0   6  160 110 3.90
    # Datsun 710        22.8   4  108  93 3.85
    # Hornet 4 Drive    21.4   6  258 110 3.08
    # Hornet Sportabout 18.7   8  360 175 3.15
    

    You can get the data individually or in the same line by including a select * from ... in your call to sqldf:

    mt2 <- sqldf(c('update mt set cyl=5 where cyl>5', 'select * from mt'))
    mt2
    #    mpg cyl disp  hp drat
    # 1 21.0   5  160 110 3.90
    # 2 21.0   5  160 110 3.90
    # 3 22.8   4  108  93 3.85
    # 4 21.4   5  258 110 3.08
    # 5 18.7   5  360 175 3.15
    

    (In this case I saved it to mt2, but you could just have easily over-written it.)

    All of this is discussed in various forms in the sqldf FAQ 8, "8. Why am I having problems with update?"

    EDIT

    There appear to be several misunderstandings about sqldf and such.

    1. You are creating a db handle for direct SQL queries (not sqldf), but you never use it. As you'll see later, either (a) use dbExecute (and related functions) with the db handle, or (b) use sqldf, no need for dbConnect and friends.

    2. When using sqldf, on each and every call to sqldf it does a complete copy of the current instance of the variable into a database. (This is both helpful and, at times, inefficient. With smaller datasets, the time lost is likely not felt, but still ...) So when you keep referring to the table films, it is ignoring the d you created, because it has no way to infer what you are trying to do outside of its call ... it just copies, queries, and discards.

      # assuming this is something like what you do ... but it doesn't matter
      films <- read.csv("films.csv", ...)
      #    `-<---<---<---<---<---<---<---<---<---<---<---<---<---<---<-+-<--.
      db <- dbConnect(SQLite(), "tempdb") # not used in sqldf          ^     \
      dbWriteTable(db, "films", films, overwrite=TRUE) # never used    ^      \
      #                             `--- is referring to --->--->--->--'       \
      d <- sqldf(c('update films set movie_title=lower(movie_title)', #         \
                                   'select * from films')) #                     \
      #                        \                      `--- (internal to sqldf)    ^
      #                         `--- refers to the original 'films' --->--->--->--'
      

      Option 1, use the RSQLite functions, not sqldf:

      db <- dbConnect(SQLite(), "tempdb")
      dbWriteTable(db,"films",films, overwrite=TRUE)
      dbExecute(db, 'update films set actor_3_name=lower(actor_3_name)')
      #        `--- repeat for all updates
      films <- dbGetQuery(db, 'select * from films')
      

      Option 2, (not my preferred) use the variable created on the previous line:

      films <- read.csv("films.csv", ...)
      #   `--<---<---<---<---<---<---<---<---<---<---<---<---<---<---<---<---<-.
      d <- sqldf(c('update films set movie_title=lower(movie_title)', #         \
                                   'select * from films')) #                     \
      #\                        \                      `--- (internal to sqldf)    ^
      # \                        `--- refers to original 'films' --->--->--->--->--'
      #  `--<---<---<---<---<---<---<---<---<---<---<---<---<---<---<---<--.
      d <- sqldf(c('update d set actor_3_name=lower(actor_3_name)', #       \
                                   'select * from d')) #                     \
      #                    \                      `--- (internal to sqldf)    ^
      #                     `--- refers to previously-created 'd' --->--->--->'
      #                         (repeat for other updates)
      

      Option 3, always reference/overwrite the original films variable:

      films <- read.csv("films.csv", ...)
      #   `--<---<---<---<---<---<---<---<---<---<---<---+--<---<---<---<---<---.
      films <- sqldf(c('update films set movie_title=lower(movie_title)', #      \
                                    'select * from films')) #                     \
      #   \                        \                   `--- (internal to sqldf)    ^
      #    \                        ` --- refers to the first 'films' -->--->--->--'
      #     `-<---<---<---<---<---<---<---<---<---<---<---+--<---<---<---<---<--.
      films <- sqldf(c('update films set actor_3_name=lower(actor_3_name)', #    \
                                    'select * from films')) #                     \
      #                            \                   `--- (internal to sqldf)    ^
      #                             ` --- refers to the second 'films' -->--->--->-'
      #                              (repeat for other updates)
      
    3. sqldf inefficiencies. Each time you make a call to sqldf, it copies the entire dataset into a temporary table. Every. Time. You can reduce some of the overhead by combining all query strings into a single call, like so:

      films <- read.csv("films.csv", ...)
      films <- sqldf(c('update films set actor_3_name=lower(actor_3_name)',
                       'update films set actor_2_name=lower(actor_2_name)',
                       'update films set actor_1_name=lower(actor_1_name)',
                       'update films set director_name=lower(director_name)',
                       'select * from films'))
      
    4. SQL inefficiencies. Your original code may be simplified for the question (which is fine), but if not then here goes. Since you do not appear to be conditioning your updates at all, you can combine the data-cleaning into one update. (This can be used with dbExecute as well.)

      films <- read.csv("films.csv", ...)
      films <- sqldf(c('update films set actor_3_name=lower(actor_3_name),
                                         actor_3_name=lower(actor_3_name),
                                         actor_2_name=lower(actor_2_name),
                                         actor_1_name=lower(actor_1_name),
                                         director_name=lower(director_name)',
                       'select * from films'))
      
    5. Do you really need SQL? This can be done quite easily/quickly in R:

      films <- read.csv("films.csv", ...)
      films <- within(films, {
        actor_3_name <- tolower(actor_3_name)
        actor_2_name <- tolower(actor_2_name)
        actor_1_name <- tolower(actor_1_name)
        director_name <- tolower(director_name)
      })