I was testing an example from RStudio about "render SQL code" from dbplyr:
library(nycflights13)
ranked <- flights %>%
group_by(year, month, day) %>%
select(dep_delay) %>%
mutate(rank = rank(desc(dep_delay)))
dbplyr::sql_render(ranked)
But when run, it returns the following error message:
Error in UseMethod("sql_render") : no applicable method for 'sql_render' applied to an object of class "c('grouped_df', 'tbl_df', 'tbl', 'data.frame')"
Can someone explain why?
When you are working on a "normal" data.frame
, then it returns a frame, in which case sql_render
is inappropriate (and will be very confused). If we work with just your initial code, then we can see that SQL has nothing to do with it:
library(dplyr)
library(nycflights)
ranked <- flights %>%
group_by(year, month, day) %>%
select(dep_delay) %>%
mutate(rank = rank(desc(dep_delay)))
ranked
# # A tibble: 336,776 x 5
# # Groups: year, month, day [365]
# year month day dep_delay rank
# <int> <int> <int> <dbl> <dbl>
# 1 2013 1 1 2 313
# 2 2013 1 1 4 276
# 3 2013 1 1 2 313
# 4 2013 1 1 -1 440
# 5 2013 1 1 -6 742
# 6 2013 1 1 -4 633
# 7 2013 1 1 -5 691
# 8 2013 1 1 -3 570
# 9 2013 1 1 -3 570
# 10 2013 1 1 -2 502.
# # ... with 336,766 more rows
But dbplyr
won't be able to do something with that:
library(dbplyr)
sql_render(ranked)
# Error in UseMethod("sql_render") :
# no applicable method for 'sql_render' applied to an object of class "c('grouped_df', 'tbl_df', 'tbl', 'data.frame')"
If, however, we have that same flights
data in a database, then we can do what you are expecting, with some minor changes.
# pgcon <- DBI::dbConnect(odbc::odbc(), ...) # to my local postgres instance
copy_to(pgcon, flights, name = "flights_table") # go get some coffee
flights_db <- tbl(pgcon, "flights_table")
ranked_db <- flights_db %>%
group_by(year, month, day) %>%
select(dep_delay) %>%
mutate(rank = rank(desc(dep_delay)))
# Adding missing grouping variables: `year`, `month`, `day`
We can see some initial data, showing the top 10 rows of what the query will eventually return:
ranked_db
# # Source: lazy query [?? x 5]
# # Database: postgres [postgres@localhost:/]
# # Groups: year, month, day
# year month day dep_delay rank
# <int> <int> <int> <dbl> <int64>
# 1 2013 1 1 NA 1
# 2 2013 1 1 NA 1
# 3 2013 1 1 NA 1
# 4 2013 1 1 NA 1
# 5 2013 1 1 853 5
# 6 2013 1 1 379 6
# 7 2013 1 1 290 7
# 8 2013 1 1 285 8
# 9 2013 1 1 260 9
# 10 2013 1 1 255 10
# # ... with more rows
and we can see what the real SQL query looks like:
sql_render(ranked_db)
# <SQL> SELECT "year", "month", "day", "dep_delay", RANK() OVER (PARTITION BY "year", "month", "day" ORDER BY "dep_delay" DESC) AS "rank"
# FROM "flights_table"
Realizing that, due to the way dbplyr
operates, we don't know how many rows will be returned until we collect
it:
nrow(ranked_db)
# [1] NA
res <- collect(ranked_db)
nrow(res)
# [1] 336776
res
# # A tibble: 336,776 x 5 # <--- no longer 'Source: lazy query [?? x 5]'
# # Groups: year, month, day [365]
# year month day dep_delay rank
# <int> <int> <int> <dbl> <int64>
# 1 2013 1 1 NA 1
# 2 2013 1 1 NA 1
# 3 2013 1 1 NA 1
# 4 2013 1 1 NA 1
# 5 2013 1 1 853 5
# 6 2013 1 1 379 6
# 7 2013 1 1 290 7
# 8 2013 1 1 285 8
# 9 2013 1 1 260 9
# 10 2013 1 1 255 10
# # ... with 336,766 more rows