I'm working in R
with MySQL
tables. I'm able to do what I need using an SQL query, but I want to know if I can make use of dbplyr
in order to have "consistent" code (I used it later to make more filters)
The "tables" i am talking about are MySQL tables that I work with RMariaDB
.
I have two tables:
> genes <- data.frame(geneid = c(1:3), thing1 = c("a","b","c"), thing2 = c("d", "f", "g"))
> genes
geneid thing1 thing2
1 1 a d
2 2 b f
3 3 c g
> diseases <- data.frame(diseaseid = c(4:6), thing3 = c("a","b","c"), thing4 = c("d", "f", "g"))
> diseases
diseaseid thing3 thing4
1 4 a d
2 5 b f
3 6 c g
Now, I have a third one where two of its columns are the keys from this tables:
> gd <- data.frame(gd = c(7:9), geneid = c(1:3), diseaseid = c(4:6), thing5 = c("dd2d", "f2ff", "g2gg"))
> gd
gd geneid diseaseid thing5
1 7 1 4 dd2d
2 8 2 5 f2ff
3 9 3 6 g2gg
I want to create an output table by joining the genes
and disease
tables on to gd
using the "geneid" and "diseaseid" columns. The output table should have three columns: "thing1", "thing4", "thing5".
> new_gd
thing1 thing4 thing5
1 a ddd dd2d
2 b fff f2ff
3 c ggg g2gg
I can do this with this SQL query (sort of):
select gd.*, g.thing1, d.thing4
from gene_disease as gd
left join genes as g
on gd.geneid = g.geneid
left join diseases as d
on gd.diseaseid = d.diseaseid
But still I have to filter out the other columns from the original two tables I don't want to see. Is there a way to make this all in dbplyr
, even if I have two make 2 "lines" (I know that dbplyr
can't make joins for more than two tables).
I am not sure where you learned that dbplyr can not join more than two tables. We can do this using dbplyr in the same way we would do it in R: by specifying multiple joins.
I would approach this problem as follows:
# connect to database
db_con = DBI::dbConnect( your_database_connection_details_here )
# connect to tables
remote_genes = dplyr::tbl(db_con, from = "genes")
remote_disease = dplyr::tbl(db_con, from = "disease")
remote_gd = dplyr::tbl(db_con, from = "gd")
# combine
remote_new_gd = remote_gd %>%
left_join(remote_genes, by = "geneid") %>%
left_join(remote_diseases, by = "diseaseid") %>%
select(thing1, thing4, thing5)
The key thing here is that we use the same database connection for all three tables. You may need to fiddle with the dplyr::tbl(db_con, ...
commands to include the database name if the tables are in different databases.
Possible confusion in terminology: In dplyr, filter
ing applies to rows. select
is the command for including only specific columns. There is no way to avoid using a select
command if you only want some output columns.