Calling the dplyr::arrange()
on a table in a remote source adds an 'Ordered by: ...' flag. Is there a subsequent function that removes this 'Ordered by:' flag on the remote table?
Consider example data:
tmp_cars_sdf <-
copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)
For which:
glimpse(tmp_cars_sdf)
# Observations: ??
# Variables: 2
# Database: postgres 9.5.3
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13...
# $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26...
Consider:
tmp_cars <-
cars
tmp_cars <-
tmp_cars %>%
arrange(speed, dist)
glimpse(tmp_cars)
# Observations: 50
# Variables: 2
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13,...
# $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34,...
However:
tmp_cars <-
tmp_cars_sdf %>%
arrange(speed, dist)
glimpse(tmp_cars)
# Observations: ??
# Variables: 2
# Database: postgres 9.5.3
# Ordered by: speed, dist
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13,...
# $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34,...
Inspired by Simon's answer and comments on the OP, the following function is a work-around that removes all ordering (but preserves any new column computed as a result of the ordering). It may not be the most efficient or low-level/direct way of doing this, which I will come back to at the end of this answer, but I will let the dbplyr
team resolve my issue if they see fit to do so.
unarrange <-
function(remote_df) {
existing_groups <- groups(remote_df)
remote_df <-
remote_df %>%
compute()
remote_df <-
tbl(remote_df$src$con,
sql_render(remote_df))
remote_df <-
group_by(remote_df, !!!existing_groups)
return(remote_df)
}
With the input data:
tmp_cars_sdf <-
copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)
Consider
str(tmp_cars_sdf)
# ..$ con <truncated>
# ..$ disco <truncated>
# $ ops:List of 2
# ..$ x : 'ident' chr "tmp_cars_sdf"
# ..$ vars: chr [1:2] "speed" "dist"
# ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
# - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
vs
tmp_cars_sdf <-
tmp_cars_sdf %>%
arrange(speed, dist)
str(tmp_cars_sdf)
# $ ops:List of 4
# ..$ name: chr "arrange"
# ..$ x :List of 2
# .. ..$ x : 'ident' chr "tmp_cars_sdf"
# .. ..$ vars: chr [1:2] "speed" "dist"
# .. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
# ..$ dots:List of 2
# .. ..$ : language ~speed
# .. .. ..- attr(*, ".Environment")=<environment: 0x000000002556b260>
# .. ..$ : language ~dist
# .. .. ..- attr(*, ".Environment")=<environment: 0x000000002556b260>
# ..$ args:List of 1
# .. ..$ .by_group: logi FALSE
# ..- attr(*, "class")= chr [1:3] "op_arrange" "op_single" "op"
# - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
It is clear that adding ordering via arrange
actually modifies the structure of the R object, due to the fact that remote tables have no way of being intrinsically ordered (or grouped), order and grouping info must be stored locally and is only transmitted when the final query is built.
Thus, the work-around uses three tricks: first, produce a temporary table using compute()
. Note that doing this does not reset groups and ordering locally. Second, use Simon's trick to pull the simple select query corresponding to this new table, and overwrite the existing table structure so that all grouping and ordering info is lost. In order to preserve groups, the function re-adds the original groups to this table.
While the example provided in the OP serves to show the problem, the reason why it arose is because of mutates that depend on some (grouped) ordering on the table. Once the new columns are built, the old ordering is no longer necessary and is in fact sometimes a hindrance due to the linked issue on github. Such an example is given below:
tmp_cars_sdf <-
copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)
cars_df <-
cars %>%
arrange(speed, dist) %>%
group_by(speed) %>%
mutate(diff_dist_up = dist - lag(dist)) %>%
arrange(speed, desc(dist)) %>%
mutate(diff_dist_down = dist - lag(dist)) %>%
ungroup() %>%
arrange(speed, dist) %>%
data.frame()
So that:
head(cars_df)
# speed dist diff_dist_up diff_dist_down
# 1 4 2 NA -8
# 2 4 10 8 NA
# 3 7 4 NA -18
# 4 7 22 18 NA
# 5 8 16 NA NA
# 6 9 10 NA NA
With the new function we can replicate this remotely:
cars_df_2 <-
tmp_cars_sdf %>%
arrange(speed, dist) %>%
group_by(speed) %>%
mutate(diff_dist_up = dist - lag(dist)) %>%
# unfortunately the next line is needed
# because of https://github.com/tidyverse/dbplyr/issues/345
unarrange() %>%
arrange(speed, desc(dist)) %>%
mutate(diff_dist_down = dist - lag(dist)) %>%
ungroup() %>%
unarrange() %>%
collect() %>%
arrange(speed, dist) %>%
data.frame()
And checking, we see:
identical(cars_df, cars_df_2)
# [1] TRUE
The first issue is the necessity to call compute()
which uses resources. The second issue is that it must be possible to modify the structure of the R object that encodes the remote table, but I don't know how queries are built from this structure so am unable to do it.