Search code examples
rdbplyr

Wrong Results/Bug with last() function when using dbplyr in r


I am trying to run sql using r dbplyr and have come across a bug/wrong results when using last() function with dbplyr code. I am not sure if I am doing it wrong or if there is actually a bug.

Objective: To get last Status of each employee from the table using dbplyr.

I have tried using below code to replicate the issue.

1. Libs

library(tidyverse)
library(lubridate)
library(dbplyr)
library(RSQLite)

2. Data

df_emp_status <- data.frame(
  Emp_id = c(121,321,451,121,451,451,321,755),
  
  TimeStamp = c('29-07-2019 08:55:55','29-07-2019 09:02:55','29-07-2019 09:05:50',
               '29-07-2019 10:05:50','29-07-2019 10:07:50','29-07-2019 10:10:10',
               '29-07-2019 10:20:10','29-07-2019 11:00:00'),
  
  Status = c('IN','IN','IN','OUT','OUT','IN','OUT','IN')
          )

3. DB connection & Data

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
df_emp_status_sql <- copy_to(con, df_emp_status, overwrite = TRUE)

4. Objective: To get last Status of each employee

4.1 R Code

using summarise()

df_emp_status %>% 
  group_by(Emp_id) %>% 
  arrange(Emp_id,TimeStamp) %>% 
  summarise(last_status = last(Status))
# A tibble: 4 × 2
  Emp_id last_status
   <dbl> <chr>      
1    121 OUT        
2    321 OUT        
3    451 IN         
4    755 IN

using mutate

df_emp_status %>% 
  group_by(Emp_id) %>% 
  arrange(Emp_id,TimeStamp) %>% 
  mutate(last_status = last(Status))
# A tibble: 8 × 4
# Groups:   Emp_id [4]
  Emp_id TimeStamp           Status last_status
   <dbl> <dttm>              <chr>  <chr>      
1    121 2019-07-29 08:55:55 IN     OUT        
2    121 2019-07-29 10:05:50 OUT    OUT        
3    321 2019-07-29 09:02:55 IN     OUT        
4    321 2019-07-29 10:20:10 OUT    OUT        
5    451 2019-07-29 09:05:50 IN     IN         
6    451 2019-07-29 10:07:50 OUT    IN         
7    451 2019-07-29 10:10:10 IN     IN         
8    755 2019-07-29 11:00:00 IN     IN 

Above Results are correct & giving me the expected results:

4.2 SQL / DBPLYR replication (gives wrong results)

since last() doesn't work with summarise when using dbplyr so gonna use mutate() instead

df_emp_status_sql %>% 
  group_by(Emp_id) %>% 
  arrange(Emp_id,TimeStamp) %>% 
  mutate(last_status = last(Status))
# Source:     SQL [8 x 4]
# Database:   sqlite 3.39.4 [:memory:]
# Groups:     Emp_id
# Ordered by: Emp_id, TimeStamp
  Emp_id  TimeStamp Status last_status
   <dbl>      <dbl> <chr>  <chr>      
1    121 1564390555 IN     IN         
2    121 1564394750 OUT    OUT        
3    321 1564390975 IN     IN         
4    321 1564395610 OUT    OUT        
5    451 1564391150 IN     IN         
6    451 1564394870 OUT    OUT        
7    451 1564395010 IN     IN         
8    755 1564398000 IN     IN 

Above results are wrong

4.3 Query Crosscheck

df_emp_status_sql %>% 
  group_by(Emp_id) %>% 
  arrange(Emp_id,TimeStamp) %>% 
  mutate(last_status = last(Status)) %>% 
  show_query()
<SQL>
SELECT
  *,
  LAST_VALUE(`Status`) OVER (PARTITION BY `Emp_id` ORDER BY `Emp_id`, `TimeStamp`) AS `last_status`
FROM `df_emp_status`
ORDER BY `Emp_id`, `TimeStamp`

I have tried to run above query separately and this also throws Wrong Results.

4.4 Correct Query I was able to find Correct Query that worked for desired results from How to use last_value() in sql:

dbGetQuery(con,'
  SELECT
  *,
  LAST_VALUE(`Status`) OVER (
                            PARTITION BY `Emp_id` 
                                ORDER BY `Emp_id`, `TimeStamp`
                                RANGE BETWEEN 
                                  UNBOUNDED PRECEDING AND 
                                  UNBOUNDED FOLLOWING) AS `last_status`
FROM `df_emp_status`
ORDER BY `Emp_id`, `TimeStamp`
')
  Emp_id  TimeStamp Status last_status
1    121 1564390555     IN         OUT
2    121 1564394750    OUT         OUT
3    321 1564390975     IN         OUT
4    321 1564395610    OUT         OUT
5    451 1564391150     IN          IN
6    451 1564394870    OUT          IN
7    451 1564395010     IN          IN
8    755 1564398000     IN          IN

5. Conclusion:

RANGE BETWEEN 
                                  UNBOUNDED PRECEDING AND 
                                  UNBOUNDED FOLLOWING

As above code seems to be important to get the right results in sql query with last_value so the dbplyr translation should also include above code in their r to sql translation otherwise it will give wrong results.


Solution

  • I can not recall where, but I can recall seeing something about default settings for PRECEDING and FOLLOWING. It seems likely that these are determined implicitly when working with dbplyr and hence you may not be able to set them directly.

    An alternative approach that will work for this application is lead. Try something like the following:

    df_emp_status_sql %>% 
      group_by(Emp_id) %>% 
      arrange(TimeStamp) %>% 
      mutate(next_status = lead(Status)) %>%
      filter(is.na(next_status)) %>%
      select(-next_status)
    

    The idea is to use the lead function to add the next status to each record. The very last record for each id will have a missing (NA or NULL depending on choice of language) value, so you ca filter on this value.