Remove Duplicate Columns via sqlQuery()

I am working with the R programming language. Suppose I have the following data frame:

age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)

head(testframe)

  age height height2 gender gender2
1  18   76.1    76.1      M       M
2  19   77.0    77.0      F       F
3  20   78.1    78.1      M       M
4  21   78.2    78.2      M       M
5  22   78.8    78.8      F       F
6  23   79.7    79.7      F       F

If I want to remove columns with different names but identical values, I can use the following line of code:

no_dup = testframe[!duplicated(as.list(testframe))]

 head(no_dup)
  age height gender
1  18   76.1      M
2  19   77.0      F
3  20   78.1      M
4  21   78.2      M
5  22   78.8      F
6  23   79.7      F

My Question: Suppose the data frame is not located in the global environment - is it possible to pass the above line of code through a sqlQuery() command? For example:

library(RODBC)
library(sqldf)

con = odbcConnect("some name", uid = "some id", pwd = "abc")

#not sure if this is correct?
sample_query = sqlQuery(con, "testframe[!duplicated(as.list(testframe))]")

Can someone please show me how to do this?

Thanks!

Solution

This does all the substantive processing on the SQL side and only does name manipulation on the R side. The database is not downloaded to R.

The first pipeline inputs the names (we have hard coded the names in Names but you can retrieve them from the database if necessary) and returns an SQL statement, sql1, that when run against your database will produce a one line data frame from the data base that has a column for each pair of variables in testframe and whose value is the number of unequal values.

We then run sql1 using sqldf for reproducibility but you can replace that with an appropriate call to sqlQuery.

The second pipeline then uses numDF to generate one or more SQL statements in character vector sql2 to drop the duplicated columns, i.e. those for which there are zero unequal values You can then run those SQL statements against your database.

We used sqldf with SQLite for reproducibility but you can replace the calls to sqldf with an appropriately modified call to sqlQuery, e.g. sqlQuery(con, sql1) where con is the connection you have previously defined.

It is likely that whatever database system you are using accepts the same SQL but if not there may be small changes needed in the code to generate SQL accepted by whatever it is you are using.

library(magrittr)
library(sqldf)

Names <- c("age", "height", "height2", "gender", "gender2")

sql1 <- Names %>%
  { toString(sprintf("sum(%s)", combn(., 2, paste, collapse = "!="))) } %>%
  paste("select", ., "from testframe")

numDF <- sqldf(sq11)  # replace with call to your database

sql2 <- numDF %>%
  Filter(Negate(c), .) %>%
  names %>%
  sub(".*!=(.*.).", "alter table testframe drop \\1", .)

# Just run the sql2 part against your db, not select * ... part.
# The select * ... downloads table for demo purposes only.
sqldf(c(sql2, "select * from testframe"))  # replace 
##    age height gender
## 1   18   76.1      M
## 2   19   77.0      F
## 3   20   78.1      M
## 4   21   78.2      M
## ...snip...

Note that sql1 and sql2 are the following. sql1 is a single sql select statement and sql2 is a vector of sql alter statements, one statement per column to drop. If your data base allows ALTER to drop multiple columns at once you might be able to simplify that but SQLite only allows one at a time.

sql1
## [1] "select sum(age!=height), sum(age!=height2), sum(age!=gender), sum(age!=gender2), sum(height!=height2), sum(height!=gender), sum(height!=gender2), sum(height2!=gender), sum(height2!=gender2), sum(gender!=gender2) from testframe"

sql2
## [1] "alter table testframe drop height2" "alter table testframe drop gender2"