I am looking for a way to work with connected databases using R so I don't have to upload databases to my memory. I have been working using pool
or DBI
packages to connect to the database and dplyr
for data manipulation, but I have found some problems I haven't been able to solve:
Loading the Data:
library(pool)
library(dplyr)
library(RMariaDB)
my_db <- dbPool(
MariaDB(),
donate = "aaa",
host = "localhost",
username = "root"
)
- 1st Problem: Unable to use slice function
my_db %>% tbl("bbb") %>%
slice(2:10)
#Error: slice() is not supported on database backends
For this problem I have came up with this work around:
my_db %>% tbl("bbb") %>%
mutate(Rowindx = row_number()) %>%
filter(Rowindx >= 2 && Rowindx <= 10)
This has work for me, but I wanna know if there is a better solution to this problem
- 2nd Problem: Unable to change the class or type of a column using transmute()
I have a column class character and I want to change it to factor. I have tried the following code, that does work with no connected databases.
my_db %>% tbl("bbb") %>%
transmute (colname = factor(colname))
#Error: FUNCTION aaa.factor does not exist [1305]
For this problem I haven't came up with any solution, any idea about how to solve it will be appreciated.
head
works:
tbl(con, "mtcars") %>%
filter(between(disp, 50, 200)) %>%
head(3) %>%
show_query()
# <SQL>
# SELECT TOP 3 *
# FROM "mtcars"
# WHERE ("disp" BETWEEN 50.0 AND 200.0)
(I'm demonstrating against SQL Server, and it uses SELECT TOP 3
instead of all the other DBMS's standard of SELECT ... LIMIT 3
.)
Granted, this does not offer the immediate ability to do a subset such as 2:10
, but if all you're hoping to do is remove the first row, then I suggest you can slice(-1)
(in this example) after collect
ing the data completely into R (which should be done after other filtering is done to reduce the data downloaded).
Note that this is a universal slice/head
, not a per-group slice, in case you were thinking of doing that. In R/dplyr
, per-group would typically prefer to be %>% do(head(10))
(which returns the first 10 rows for each group) vice %>% head(10)
(which returns the first 10 rows of all data, regardless of group). I have not found do(head(3))
to work in SQL (though my testing has been limited).
As a side comment, you haven't mentioned ordering your data, and there is often no guarantee about the order of data in queries; if you have a specific order requirement, it should be specific. If you are not doing that, then the notion that you always get exactly the rows you want may not be guaranteed. Granted, I'm not suggesting it is stochastic, but the order of data as stored within the DBMS is intentional, determined by many things including the indices built on the table. If you are not aware of all of them and know the ins-and-outs of the DBMS, then it's probably best to just ensure ordering before you consider slicing.
In general (dplyr
, not dbplyr
), filter(Rowindx >= 2 && Rowindx <= 10)
is wrong/broken. Use filter(Rowindx >= 2 & Rowindx <= 10)
(single &
, not double) or filter(between(Rowindex, 2, 10))
. While it does still work in dbplyr
(both &
and &&
here map to SQL's AND
), it is a little lazy/sloppy to use that in R when a vector is suggested. (dplyr::between
also maps to SQL's BETWEEN
, so it might be better here, anyway.)
The notion of factor
does not have a universal equivalent in SQL, at least not in the way I believe you want to use it. It's an R thing. So I suggest that you mutate
and filter
what you need, and when you are confident you have just the rows/columns needed, then collect()
your data (retrieving all rows into R), and then transmute
into a factor
.