Search code examples
sqlrdb2

Differences in LAG() function: R vs SQL (Converting R code to SQL)


I have the following table in R:

library(dplyr)

   df = structure(list(name = c("John", "John", "John", "Mary", "Mary", 
"Mary", "Alice", "Alice", "Alice", "Bob", "Bob", "Bob"), year = c(2001, 
2002, 2003, 2001, 2002, 2003, 2001, 2002, 2003, 2001, 2002, 2003
), var1 = c("a", "a", "a", "b", "a", "b", "a", "b", "a", "b", 
"b", "b"), var2 = c("b", "a", "b", "a", "b", "a", "b", "a", "b", 
"a", "b", "a")), class = "data.frame", row.names = c(NA, -12L
))


  name year var1 var2
  John 2001    a    b
  John 2002    a    a
  John 2003    a    b
  Mary 2001    b    a
  Mary 2002    a    b
  Mary 2003    b    a
 Alice 2001    a    b
 Alice 2002    b    a
 Alice 2003    a    b
   Bob 2001    b    a
   Bob 2002    b    b
   Bob 2003    b    a

I want to answer the following question:

  • For each name, when (i.e., which row_num) does var1 change for the first time? Keep the full information for that row so we can see the change in var1_before/var1_after and var2_before/var2_after
  • If a name kept its var1 value throughout - return the full information row for the last available year corresponding to that name (along with the row_number)
  • The final result will only have 1 row per name

I spent some time and I think was able to do this in R:

df %>%
    group_by(name) %>%
    mutate(
        row_num = row_number(),
        var1_lag = lag(var1),
        var2_lag = lag(var2),
        var1_change = var1 != var1_lag,
        first_year = first(year)  
    ) %>%
    filter(var1_change | row_num == max(row_num)) %>%
    mutate(
        category = ifelse(var1_change, "Var1 Changed", "Var1 Never Changed"),

        year_when_var1_changed = ifelse(var1_change, year, NA),
        var1_before = ifelse(var1_change, var1_lag, var1),
        var1_after = var1,
        var2_before = ifelse(var1_change, var2_lag, var2),
        var2_after = var2
    ) %>%
    filter(row_number() == 1) %>%
    select(name, first_year, category, year_when_var1_changed, var1_before, var1_after, var2_before, var2_after)


  # A tibble: 4 x 8
# Groups:   name [4]
  name  first_year category           year_when_var1_changed var1_before var1_after var2_before var2_after
  <chr>      <dbl> <chr>                               <dbl> <chr>       <chr>      <chr>       <chr>     
1 John        2001 Var1 Never Changed                     NA a           a          b           b         
2 Mary        2001 Var1 Changed                         2002 b           a          a           b         
3 Alice       2001 Var1 Changed                         2002 a           b          b           a         
4 Bob         2001 Var1 Never Changed                     NA b           b          a           a 

Now, I am trying to do this in SQL (e.g DB2), but I keep getting stuck on how to convert this R code into SQL.

I know how to start - I definitely need to use Lag/Lead functions in a window style, but I can't figure out how to continue:

WITH CTE AS (
    SELECT 
        name, 
        year, 
        var1, 
        var2,
        LAG(var1, 1) OVER (PARTITION BY name ORDER BY year ASC) AS var1_before,
        LEAD(var1, 1) OVER (PARTITION BY name ORDER BY year ASC) AS var1_after,
        LAG(var2, 1) OVER (PARTITION BY name ORDER BY year ASC) AS var2_before,
        LEAD(var2, 1) OVER (PARTITION BY name ORDER BY year ASC) AS var2_after,
        ROW_NUMBER() OVER (PARTITION BY name ORDER BY year ASC) AS row_num
    FROM 
        df
)
SELECT 
  *
FROM 
    CTE;

Can someone please show me how to do this?

Thanks!

Update using @The Impaler's answer (variable change name):

select 
  name as name,
  min(year) over(partition by name) as first_year,
  case when max(v) over(partition by name) > 1 then 'Changed' else 'Not_Changed' end as category,
  year as year_when_var1_changed,
  lag1 as var1_before,
  var1 as var1_after,
  lag2 as var2_before,
  var2 as var2_after
from (
  select y.*, max(v) over(partition by name) as maxv,
    max(rn) over(partition by name) as maxrn
  from (
    select x.*, 
      sum(case when lag1 = var1 then 0 else 1 end) 
        over(partition by name order by year) as v,
      row_number() over(partition by name order by year) as rn
    from (
      select t.*,
        lag(var1) over(partition by name order by year) as lag1,
        lag(var2) over(partition by name order by year) as lag2
      from t
    ) x
  ) y
) z
where maxv > 1 and v = 2 or maxv = 1 and rn = maxrn;

Solution

  • You can do:

    select *
    from (
      select y.*, max(v) over(partition by name) as maxv,
        max(rn) over(partition by name) as maxrn
      from (
        select x.*, 
          sum(case when lag1 = var1 then 0 else 1 end) 
            over(partition by name order by year) as v,
          row_number() over(partition by name order by year) as rn
        from (
          select t.*,
            lag(var1) over(partition by name order by year) as lag1,
            lag(var2) over(partition by name order by year) as lag2
          from t
        ) x
      ) y
    ) z
    where maxv > 1 and v = 2 or maxv = 1 and rn = maxrn
    

    Result:

     NAME   YEAR  VAR1  VAR2  LAG1  LAG2  V  RN  MAXV  MAXRN 
     ------ ----- ----- ----- ----- ----- -- --- ----- ----- 
     Alice  2002  b     a     a     b     2  2   3     3     
     Bob    2003  b     a     b     b     1  3   1     3     
     John   2003  a     b     a     a     1  3   1     3     
     Mary   2002  a     b     b     a     2  2   3     3     
    

    See running example at db<>fiddle.

    The example runs in DB2, but the SQL query is generic, so it should run in most modern databases.