I have a "CSV" format dataset with 28 variables and 7400 observation. I import the dataset in R with the name of "films" and I want to clean that. In this way, I use "sqldf" and "SQLite" libraries. But when I use "sqldf" I receive the warnings and consequently my dataset does not update too:
color director_name num_critic_for_reviews duration
Color James Cameron 723 178
Color Gore Verbinski 302 169
Color Sam Mendes 602 148
the code I have used is as below:
library(RSQLite)
library(sqldf)
db <- dbConnect(SQLite(), "tempdb")
dbWriteTable(db,"films",films, overwrite=TRUE)
d <- sqldf(c('update films set movie_title=lower(movie_title)', 'select * from films'))
d <- sqldf(c('update films set actor_3_name=lower(actor_3_name)', 'select * from films'))
d <- sqldf(c('update films set actor_2_name=lower(actor_2_name)', 'select * from films'))
d <- sqldf(c('update films set actor_1_name=lower(actor_1_name)', 'select * from films'))
d <- sqldf(c('update films set director_name=lower(director_name)', 'select * from films'))
# Warning message:
# In rsqlite_fetch(res@ptr, n = n) :
# Don't need to call dbFetch() for statements, only for queries
What is the problem?
There are several problems at work here.
First:
# Warning message:
# In rsqlite_fetch(res@ptr, n = n) :
# Don't need to call dbFetch() for statements, only for queries
This is a warning, not an error. In fact, it appears to be somewhat new issue in RSQLite
, and is referenced in an open issue: https://github.com/r-dbi/RSQLite/issues/227.
(For cleanness, I'm removing it from the output in the rest of this answer, though it occurs with each non-select
query.)
Second, RSQLite
deals with a database, period. It has no opinion or awareness of variables within the R environment, so there is no suggestion that a variable in R is an immediate and perpetual representation of a database table. (There are similar methodologies using dbplyr
that do do this, roughly speaking.)
In order to provide this type of tie between R and some form of SQL querying, there's sqldf
that allows you to query against R variables as if they were actual SQL tables. When you do such a query, it grabs the data.frame as it currently looks, inserts it into a temporary database table (whether RSQLite
or another), runs the SQL code, then returns what you need.
Third: despite this apparent tie, it is quite functional in that it does not produce side-effects within the R environment. That means that if you want to store the resulting data in a way that R can use, you need to explicitly capture the new table into an R variable.
For example:
library(sqldf)
(mt <- mtcars[1:5,1:5])
# mpg cyl disp hp drat
# Mazda RX4 21.0 6 160 110 3.90
# Mazda RX4 Wag 21.0 6 160 110 3.90
# Datsun 710 22.8 4 108 93 3.85
# Hornet 4 Drive 21.4 6 258 110 3.08
# Hornet Sportabout 18.7 8 360 175 3.15
After the update, the original data is untouched.
sqldf('update mt set cyl=5 where cyl>5')
mt
# mpg cyl disp hp drat
# Mazda RX4 21.0 6 160 110 3.90
# Mazda RX4 Wag 21.0 6 160 110 3.90
# Datsun 710 22.8 4 108 93 3.85
# Hornet 4 Drive 21.4 6 258 110 3.08
# Hornet Sportabout 18.7 8 360 175 3.15
You can get the data individually or in the same line by including a select * from ...
in your call to sqldf
:
mt2 <- sqldf(c('update mt set cyl=5 where cyl>5', 'select * from mt'))
mt2
# mpg cyl disp hp drat
# 1 21.0 5 160 110 3.90
# 2 21.0 5 160 110 3.90
# 3 22.8 4 108 93 3.85
# 4 21.4 5 258 110 3.08
# 5 18.7 5 360 175 3.15
(In this case I saved it to mt2
, but you could just have easily over-written it.)
All of this is discussed in various forms in the sqldf
FAQ 8, "8. Why am I having problems with update?"
There appear to be several misunderstandings about sqldf
and such.
You are creating a db
handle for direct SQL queries (not sqldf
), but you never use it. As you'll see later, either (a) use dbExecute
(and related functions) with the db
handle, or (b) use sqldf
, no need for dbConnect
and friends.
When using sqldf
, on each and every call to sqldf
it does a complete copy of the current instance of the variable into a database. (This is both helpful and, at times, inefficient. With smaller datasets, the time lost is likely not felt, but still ...) So when you keep referring to the table films
, it is ignoring the d
you created, because it has no way to infer what you are trying to do outside of its call ... it just copies, queries, and discards.
# assuming this is something like what you do ... but it doesn't matter
films <- read.csv("films.csv", ...)
# `-<---<---<---<---<---<---<---<---<---<---<---<---<---<---<-+-<--.
db <- dbConnect(SQLite(), "tempdb") # not used in sqldf ^ \
dbWriteTable(db, "films", films, overwrite=TRUE) # never used ^ \
# `--- is referring to --->--->--->--' \
d <- sqldf(c('update films set movie_title=lower(movie_title)', # \
'select * from films')) # \
# \ `--- (internal to sqldf) ^
# `--- refers to the original 'films' --->--->--->--'
Option 1, use the RSQLite
functions, not sqldf
:
db <- dbConnect(SQLite(), "tempdb")
dbWriteTable(db,"films",films, overwrite=TRUE)
dbExecute(db, 'update films set actor_3_name=lower(actor_3_name)')
# `--- repeat for all updates
films <- dbGetQuery(db, 'select * from films')
Option 2, (not my preferred) use the variable created on the previous line:
films <- read.csv("films.csv", ...)
# `--<---<---<---<---<---<---<---<---<---<---<---<---<---<---<---<---<-.
d <- sqldf(c('update films set movie_title=lower(movie_title)', # \
'select * from films')) # \
#\ \ `--- (internal to sqldf) ^
# \ `--- refers to original 'films' --->--->--->--->--'
# `--<---<---<---<---<---<---<---<---<---<---<---<---<---<---<---<--.
d <- sqldf(c('update d set actor_3_name=lower(actor_3_name)', # \
'select * from d')) # \
# \ `--- (internal to sqldf) ^
# `--- refers to previously-created 'd' --->--->--->'
# (repeat for other updates)
Option 3, always reference/overwrite the original films
variable:
films <- read.csv("films.csv", ...)
# `--<---<---<---<---<---<---<---<---<---<---<---+--<---<---<---<---<---.
films <- sqldf(c('update films set movie_title=lower(movie_title)', # \
'select * from films')) # \
# \ \ `--- (internal to sqldf) ^
# \ ` --- refers to the first 'films' -->--->--->--'
# `-<---<---<---<---<---<---<---<---<---<---<---+--<---<---<---<---<--.
films <- sqldf(c('update films set actor_3_name=lower(actor_3_name)', # \
'select * from films')) # \
# \ `--- (internal to sqldf) ^
# ` --- refers to the second 'films' -->--->--->-'
# (repeat for other updates)
sqldf
inefficiencies. Each time you make a call to sqldf
, it copies the entire dataset into a temporary table. Every. Time. You can reduce some of the overhead by combining all query strings into a single call, like so:
films <- read.csv("films.csv", ...)
films <- sqldf(c('update films set actor_3_name=lower(actor_3_name)',
'update films set actor_2_name=lower(actor_2_name)',
'update films set actor_1_name=lower(actor_1_name)',
'update films set director_name=lower(director_name)',
'select * from films'))
SQL inefficiencies. Your original code may be simplified for the question (which is fine), but if not then here goes. Since you do not appear to be conditioning your updates at all, you can combine the data-cleaning into one update. (This can be used with dbExecute
as well.)
films <- read.csv("films.csv", ...)
films <- sqldf(c('update films set actor_3_name=lower(actor_3_name),
actor_3_name=lower(actor_3_name),
actor_2_name=lower(actor_2_name),
actor_1_name=lower(actor_1_name),
director_name=lower(director_name)',
'select * from films'))
Do you really need SQL? This can be done quite easily/quickly in R:
films <- read.csv("films.csv", ...)
films <- within(films, {
actor_3_name <- tolower(actor_3_name)
actor_2_name <- tolower(actor_2_name)
actor_1_name <- tolower(actor_1_name)
director_name <- tolower(director_name)
})