I have the following table in R:
library(dplyr)
df = structure(list(name = c("John", "John", "John", "Mary", "Mary",
"Mary", "Alice", "Alice", "Alice", "Bob", "Bob", "Bob"), year = c(2001,
2002, 2003, 2001, 2002, 2003, 2001, 2002, 2003, 2001, 2002, 2003
), var1 = c("a", "a", "a", "b", "a", "b", "a", "b", "a", "b",
"b", "b"), var2 = c("b", "a", "b", "a", "b", "a", "b", "a", "b",
"a", "b", "a")), class = "data.frame", row.names = c(NA, -12L
))
name year var1 var2
John 2001 a b
John 2002 a a
John 2003 a b
Mary 2001 b a
Mary 2002 a b
Mary 2003 b a
Alice 2001 a b
Alice 2002 b a
Alice 2003 a b
Bob 2001 b a
Bob 2002 b b
Bob 2003 b a
I want to answer the following question:
I spent some time and I think was able to do this in R:
df %>%
group_by(name) %>%
mutate(
row_num = row_number(),
var1_lag = lag(var1),
var2_lag = lag(var2),
var1_change = var1 != var1_lag,
first_year = first(year)
) %>%
filter(var1_change | row_num == max(row_num)) %>%
mutate(
category = ifelse(var1_change, "Var1 Changed", "Var1 Never Changed"),
year_when_var1_changed = ifelse(var1_change, year, NA),
var1_before = ifelse(var1_change, var1_lag, var1),
var1_after = var1,
var2_before = ifelse(var1_change, var2_lag, var2),
var2_after = var2
) %>%
filter(row_number() == 1) %>%
select(name, first_year, category, year_when_var1_changed, var1_before, var1_after, var2_before, var2_after)
# A tibble: 4 x 8
# Groups: name [4]
name first_year category year_when_var1_changed var1_before var1_after var2_before var2_after
<chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr>
1 John 2001 Var1 Never Changed NA a a b b
2 Mary 2001 Var1 Changed 2002 b a a b
3 Alice 2001 Var1 Changed 2002 a b b a
4 Bob 2001 Var1 Never Changed NA b b a a
Now, I am trying to do this in SQL (e.g DB2), but I keep getting stuck on how to convert this R code into SQL.
I know how to start - I definitely need to use Lag/Lead functions in a window style, but I can't figure out how to continue:
WITH CTE AS (
SELECT
name,
year,
var1,
var2,
LAG(var1, 1) OVER (PARTITION BY name ORDER BY year ASC) AS var1_before,
LEAD(var1, 1) OVER (PARTITION BY name ORDER BY year ASC) AS var1_after,
LAG(var2, 1) OVER (PARTITION BY name ORDER BY year ASC) AS var2_before,
LEAD(var2, 1) OVER (PARTITION BY name ORDER BY year ASC) AS var2_after,
ROW_NUMBER() OVER (PARTITION BY name ORDER BY year ASC) AS row_num
FROM
df
)
SELECT
*
FROM
CTE;
Can someone please show me how to do this?
Thanks!
Update using @The Impaler's answer (variable change name):
select
name as name,
min(year) over(partition by name) as first_year,
case when max(v) over(partition by name) > 1 then 'Changed' else 'Not_Changed' end as category,
year as year_when_var1_changed,
lag1 as var1_before,
var1 as var1_after,
lag2 as var2_before,
var2 as var2_after
from (
select y.*, max(v) over(partition by name) as maxv,
max(rn) over(partition by name) as maxrn
from (
select x.*,
sum(case when lag1 = var1 then 0 else 1 end)
over(partition by name order by year) as v,
row_number() over(partition by name order by year) as rn
from (
select t.*,
lag(var1) over(partition by name order by year) as lag1,
lag(var2) over(partition by name order by year) as lag2
from t
) x
) y
) z
where maxv > 1 and v = 2 or maxv = 1 and rn = maxrn;
You can do:
select *
from (
select y.*, max(v) over(partition by name) as maxv,
max(rn) over(partition by name) as maxrn
from (
select x.*,
sum(case when lag1 = var1 then 0 else 1 end)
over(partition by name order by year) as v,
row_number() over(partition by name order by year) as rn
from (
select t.*,
lag(var1) over(partition by name order by year) as lag1,
lag(var2) over(partition by name order by year) as lag2
from t
) x
) y
) z
where maxv > 1 and v = 2 or maxv = 1 and rn = maxrn
Result:
NAME YEAR VAR1 VAR2 LAG1 LAG2 V RN MAXV MAXRN
------ ----- ----- ----- ----- ----- -- --- ----- -----
Alice 2002 b a a b 2 2 3 3
Bob 2003 b a b b 1 3 1 3
John 2003 a b a a 1 3 1 3
Mary 2002 a b b a 2 2 3 3
See running example at db<>fiddle.
The example runs in DB2, but the SQL query is generic, so it should run in most modern databases.