I can use the following code in R to select distinct rows in any generic SQL database. I'd use dplyr::distinct()
but it's not supported in SQL syntax. Anyways, this does indeed work:
dbGetQuery(database_name,
"SELECT t.*
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name) AS SEQNUM
FROM table_name t
) t
WHERE SEQNUM = 1;")
I've been using it with success, but wonder how I can pipe that same SQL query after other dplyr steps, as opposed to just using it as a first step as shown above. This is best illustrated with an example:
distinct.df <-
left_join(sql_table_1, sql_table_2, by = "col5") %>%
sql("SELECT t.*
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name) AS SEQNUM
FROM table_name t
) t
WHERE SEQNUM = 1;")
So I dplyr::left_join()
two SQL tables, then I want to look at distinct rows, and keep all columns. Do I pipe SQL code into R as shown above (simply utilizing the sql()
function)? And if so what would I use for the table_name
on the line FROM table_name t
?
In my first example I use the actual table name that I'm pulling from. It's too obvious! But in this case I am piping and am used to using the magrittr pronoun .
or sometimes the .data
pronoun from rlang if I were in memory working in R without databases.
I'm in a SQL database though... so how do I handle this situation? How do I properly pipe my known working SQL into my R code (with a proper table name pronoun)? dbplyr's reference page is a good starting point but doesn't really answer this specific question.
It looks like you are wanting to combine custom SQL code with auto-generated SQL code from dbplyr
. For this it is important to distinguish between:
DBI::db*
commands - that execute the provided SQL on the database and return the result.dbplyr
translation - where you work with a remote connection to a tableYou can only combine these in certain ways. Below I have given several examples depending on your particular use case. All assume that DISTINCT
is a command that is accepted in your specific SQL environment.
If you'll excuse some self-promotion, I recommend you take a look at my dbplyr_helpers
GitHub repository (here). This includes:
union_all
function that takes in two tables accessed via dbplyr
and outputs a single table using some custom SQL code.write_to_datebase
function that takes a table accessed via dbplyr
and converts it to code that can be executed via DBI::dbExecute
dbplyr
automatically pipes your code into the next query for you when you are working with standard dplyr
verbs for which there are SQL translations defined. So long as sql translations are defined you can chain together many pipes (I used 10 or more at once) with the (almost) only disadvantage being that the sql translated query gets difficult for a human to read.
For example, consider the following:
library(dbplyr)
library(dplyr)
tmp_df = data.frame(col1 = c(1,2,3), col2 = c("a","b","c"))
df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())
df = left_join(df1, df2, by = "col1") %>%
distinct()
When you then call show_query(df)
R returns the following auto-generated SQL code:
SELECT DISTINCT *
FROM (
SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)
) `dbplyr_002`
But not as nicely formatted. Note that the initial command (left join) appears as a nested query, with a distinct in the outer query. Hence df
is an R link to a remote database table defined by the above sql query.
You can pipe dbplyr
into custom SQL functions. Piping means that the thing being piped becomes the first argument of the receiving function.
custom_distinct <- function(df){
db_connection <- df$src$con
sql_query <- build_sql(con = db_connection,
"SELECT DISTINCT * FROM (\n",
sql_render(df),
") AS nested_tbl"
)
return(tbl(db_connection, sql(sql_query)))
}
df = left_join(df1, df2, by = "col1") %>%
custom_distinct()
When you then call show_query(df)
R should return the following SQL code (I say 'should' because I can not get this working with simulated sql connections), but not as nicely formatted:
SELECT DISTINCT * FROM (
SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)
) nested_tbl
As with the previous example, df
is an R link to a remote database table defined by the above sql query.
You can take the code from an existing dbplyr
remote table and convert it to a string that can be executed using DBI::db*
.
As another way of writing a distinct query:
df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())
df = left_join(df1, df2, by = "col1")
custom_distinct2 = paste0("SELECT DISTINCT * FROM (",
as.character(sql_render(df)),
") AS nested_table")
local_table = dbGetQuery(db_connection, custom_distinct2)
Which will return a local R dataframe with the equivalent sql command as per the previous examples.