Search code examples
rdplyrdbplyr

Produce table requiring columns from three input tables


I'm working in R with MySQL tables. I'm able to do what I need using an SQL query, but I want to know if I can make use of dbplyr in order to have "consistent" code (I used it later to make more filters)

The "tables" i am talking about are MySQL tables that I work with RMariaDB.

I have two tables:

> genes <- data.frame(geneid = c(1:3), thing1 = c("a","b","c"), thing2 = c("d", "f", "g"))
> genes
  geneid thing1 thing2
1      1      a      d
2      2      b      f
3      3      c      g

> diseases <- data.frame(diseaseid = c(4:6), thing3 = c("a","b","c"), thing4 = c("d", "f", "g"))
> diseases
  diseaseid thing3 thing4
1         4      a      d
2         5      b      f
3         6      c      g

Now, I have a third one where two of its columns are the keys from this tables:

> gd <- data.frame(gd = c(7:9), geneid = c(1:3), diseaseid = c(4:6), thing5 = c("dd2d", "f2ff", "g2gg"))
> gd
  gd geneid diseaseid thing5
1  7      1         4   dd2d
2  8      2         5   f2ff
3  9      3         6   g2gg

I want to create an output table by joining the genes and disease tables on to gd using the "geneid" and "diseaseid" columns. The output table should have three columns: "thing1", "thing4", "thing5".

> new_gd
  thing1 thing4 thing5
1      a    ddd   dd2d
2      b    fff   f2ff
3      c    ggg   g2gg

I can do this with this SQL query (sort of):

select gd.*, g.thing1, d.thing4
from gene_disease as gd
left join genes as g
on gd.geneid = g.geneid
left join diseases as d
on gd.diseaseid = d.diseaseid

But still I have to filter out the other columns from the original two tables I don't want to see. Is there a way to make this all in dbplyr, even if I have two make 2 "lines" (I know that dbplyr can't make joins for more than two tables).


Solution

  • I am not sure where you learned that dbplyr can not join more than two tables. We can do this using dbplyr in the same way we would do it in R: by specifying multiple joins.

    I would approach this problem as follows:

    # connect to database
    db_con = DBI::dbConnect( your_database_connection_details_here )
    # connect to tables
    remote_genes = dplyr::tbl(db_con, from = "genes")
    remote_disease = dplyr::tbl(db_con, from = "disease")
    remote_gd = dplyr::tbl(db_con, from = "gd")
    
    # combine
    remote_new_gd = remote_gd %>%
      left_join(remote_genes, by = "geneid") %>%
      left_join(remote_diseases, by = "diseaseid") %>%
      select(thing1, thing4, thing5)
    

    The key thing here is that we use the same database connection for all three tables. You may need to fiddle with the dplyr::tbl(db_con, ... commands to include the database name if the tables are in different databases.

    Possible confusion in terminology: In dplyr, filtering applies to rows. select is the command for including only specific columns. There is no way to avoid using a select command if you only want some output columns.