Search code examples
rdplyrdbplyr

How do you remove ordering on remote (postgresql) table?


Calling the dplyr::arrange() on a table in a remote source adds an 'Ordered by: ...' flag. Is there a subsequent function that removes this 'Ordered by:' flag on the remote table?

Consider example data:

tmp_cars_sdf <-
    copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)

For which:

glimpse(tmp_cars_sdf)
# Observations: ??
#     Variables: 2
# Database: postgres 9.5.3
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13...
# $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26...

Consider:

tmp_cars <-
    cars
tmp_cars <-
    tmp_cars %>%
    arrange(speed, dist)
glimpse(tmp_cars)

# Observations: 50
# Variables: 2
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13,...
# $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34,...

However:

tmp_cars <-
    tmp_cars_sdf %>%
    arrange(speed, dist)
glimpse(tmp_cars)

# Observations: ??
#     Variables: 2
# Database: postgres 9.5.3 
# Ordered by: speed, dist
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13,...
# $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34,...

Solution

  • Inspired by Simon's answer and comments on the OP, the following function is a work-around that removes all ordering (but preserves any new column computed as a result of the ordering). It may not be the most efficient or low-level/direct way of doing this, which I will come back to at the end of this answer, but I will let the dbplyr team resolve my issue if they see fit to do so.

    unarrange <-
        function(remote_df) {
    
         existing_groups <- groups(remote_df)
    
    
            remote_df <-
                remote_df %>%
                compute()
    
             remote_df <-
                tbl(remote_df$src$con, 
                    sql_render(remote_df))
    
    
             remote_df <-
                 group_by(remote_df, !!!existing_groups)
    
    
    
             return(remote_df)
    
        }
    

    Why it works

    With the input data:

    tmp_cars_sdf <-
        copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)
    

    Consider

    str(tmp_cars_sdf)
    # ..$ con <truncated>
    # ..$ disco <truncated>
    # $ ops:List of 2
    # ..$ x   : 'ident' chr "tmp_cars_sdf"
    # ..$ vars: chr [1:2] "speed" "dist"
    # ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
    # - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
    

    vs

    tmp_cars_sdf <-
        tmp_cars_sdf %>%
        arrange(speed, dist)
    
    str(tmp_cars_sdf)
    # $ ops:List of 4
    # ..$ name: chr "arrange"
    # ..$ x   :List of 2
    # .. ..$ x   : 'ident' chr "tmp_cars_sdf"
    # .. ..$ vars: chr [1:2] "speed" "dist"
    # .. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
    # ..$ dots:List of 2
    # .. ..$ : language ~speed
    # .. .. ..- attr(*, ".Environment")=<environment: 0x000000002556b260> 
    #     .. ..$ : language ~dist
    # .. .. ..- attr(*, ".Environment")=<environment: 0x000000002556b260> 
    #     ..$ args:List of 1
    # .. ..$ .by_group: logi FALSE
    # ..- attr(*, "class")= chr [1:3] "op_arrange" "op_single" "op"
    # - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
    

    It is clear that adding ordering via arrange actually modifies the structure of the R object, due to the fact that remote tables have no way of being intrinsically ordered (or grouped), order and grouping info must be stored locally and is only transmitted when the final query is built.

    Thus, the work-around uses three tricks: first, produce a temporary table using compute(). Note that doing this does not reset groups and ordering locally. Second, use Simon's trick to pull the simple select query corresponding to this new table, and overwrite the existing table structure so that all grouping and ordering info is lost. In order to preserve groups, the function re-adds the original groups to this table.

    Why is this useful?

    While the example provided in the OP serves to show the problem, the reason why it arose is because of mutates that depend on some (grouped) ordering on the table. Once the new columns are built, the old ordering is no longer necessary and is in fact sometimes a hindrance due to the linked issue on github. Such an example is given below:

    tmp_cars_sdf <-
        copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)
    
    
    cars_df <-
        cars %>%
        arrange(speed, dist) %>%
        group_by(speed) %>%
        mutate(diff_dist_up = dist - lag(dist)) %>%
        arrange(speed, desc(dist)) %>%
        mutate(diff_dist_down = dist - lag(dist)) %>%
        ungroup() %>%
        arrange(speed, dist) %>%
        data.frame()
    

    So that:

    head(cars_df)
    # speed dist diff_dist_up diff_dist_down
    # 1     4    2           NA             -8
    # 2     4   10            8             NA
    # 3     7    4           NA            -18
    # 4     7   22           18             NA
    # 5     8   16           NA             NA
    # 6     9   10           NA             NA
    

    With the new function we can replicate this remotely:

    cars_df_2 <-
        tmp_cars_sdf %>%
        arrange(speed, dist) %>%
        group_by(speed) %>%
        mutate(diff_dist_up = dist - lag(dist)) %>%
        # unfortunately the next line is needed
        # because of https://github.com/tidyverse/dbplyr/issues/345
        unarrange() %>%
        arrange(speed, desc(dist)) %>%
        mutate(diff_dist_down = dist - lag(dist)) %>%
        ungroup() %>%
        unarrange() %>%
        collect() %>%
        arrange(speed, dist) %>%
        data.frame()
    

    And checking, we see:

    identical(cars_df, cars_df_2)
    # [1] TRUE
    

    Possible issues with this fix

    The first issue is the necessity to call compute() which uses resources. The second issue is that it must be possible to modify the structure of the R object that encodes the remote table, but I don't know how queries are built from this structure so am unable to do it.