Search code examples
sqlrrodbc

Sort by one variable, group by another, and select first row in SQL Query in R


I need to apply a procedure in SQL that is easy for me since R, but has been really tortuous in SQL.

I need to sort the data from highest to lowest by two variables, group based on another variable, and select the first item in each group.

I leave the code that I am trying to pass from R to SQL. Unfortunately the dbplyr package throws me an error when trying to convert one language to another: Error: first() is only available in a windowed (mutate()) context

library(tidyverse)
library(dbplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)

mtcars2 <- tbl(con, "mtcars")
mtcars2

mtcars2 %>% 
  arrange(-mpg,-disp) %>% 
  group_by(cyl) %>% 
  summarise(hp = first(hp)) %>% 
  show_query()

It seems to me that the DISTINCT ON function could help me.

Thanks for your help.


Solution

  • Maybe the following?

    library(tidyverse)
    library(dbplyr)
    
    con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
    copy_to(con, mtcars)
    
    mtcars2 <- tbl(con, "mtcars")
    
    mtcars2 %>%
      arrange(-mpg,-disp) %>%
      group_by(cyl) %>%
      mutate(hp = first(hp)) %>% 
      select(cyl, hp) %>% 
      distinct %>% 
      show_query
    
    #> <SQL>
    #> SELECT DISTINCT `cyl`, FIRST_VALUE(`hp`) OVER (PARTITION BY `cyl` ORDER BY -`mpg`, -`disp`) AS `hp`
    #> FROM `mtcars`
    #> ORDER BY -`mpg`, -`disp`
    

    See: https://github.com/tidyverse/dbplyr/issues/129