Search code examples
rpostgresqldplyrdbplyr

dplyr & postgresql get group_indices


I failed to find a solution to this task:

For simple case let use mtcars as an example and let's use a PostgreSQL version of the same data:

library(RPostgreSQL)
library(DBI)
library(tidyverse)

pgdrv <- dbDriver(drvName = "PostgreSQL")
db <-dbConnect(pgdrv,
                     dbname="DBMTCARS",
                     host="127.0.0.1", port=5432,
                     user = 'postgres')
copy_to(db, mtcars)
mtcars2 <- tbl(db, mtcars)

The task is quite simple, get a unique number for each group.
So:

mtcars %>% mutate( gi = group_indices(., gear, carb))

now I need the same with the DB version

mtcars2 %>% mutate( gi = group_indices(., gear, carb))

But I got:

Error in UseMethod("escape") : no applicable method for 'escape' applied to an object of class "c('tbl_PostgreSQLConnection', 'tbl_dbi', 'tbl_sql', 'tbl_lazy', 'tbl')"

I know how to do it with SQL syntax:

select *, (dense_rank over(order by gear, carb)) as gi FROM mtcars ;

But I would like to do with dplyr grammar or.. how do I use dbSendQuery directly in a tbl_dbi class?

Thanks in advance


Solution

  • As @akrun comments, not all functions translate. However, this can be made to work to your advantage in some cases.

    Here is an approach using dense_rank function within tidy verse:

    approach1 = mtcars2 %>%
      mutate(combined_index = paste0(gear, carb)) %>%
      mutate(gi = dense_rank(combined_index))
    

    I am pretty sure both paste0 and dense_rank will translate via dbplyr to PostgreSQL, but can not test in my current environment.

    Alternatively, you can make use of the fact that when dbplyr can not translate a command it leaves it as is. This means you can write some PostgreSQL commands directly into your R. Here is an approach using this idea:

    approach2 = mtcars2 %>%
      mutate(
                 gi = sql("DENSE_RANK() OVER(ORDER BY gear, carb)")
            )
    

    The idea is that because dbplyr does not have a translation for DENSE_RANK it will pass this, as is, through to the SQL command. The downside of this approach is that you can not transfer your code easily to an environment that does not have PostgreSQL.

    Before executing either of these, you can test whether they will work by checking whether they produce correct SQL syntax as follows:

    approach1 %>% show_query()
    approach2 %>% show_query()