I have the following dataset ("my_data"):
> my_data = data.frame(id = c(1,1,1,1,2,2,2,3,4,4,5,5,5,5,5), var_1 = sample(c(0,1), 15, replace = TRUE) , var_2 =sample(c(0,1), 15 , replace = TRUE) )
> head(my_data)
id var_1 var_2
1 1 0 0
2 1 0 0
3 1 0 1
4 1 1 0
5 2 1 1
6 2 1 1
Using the DPLYR library (R programming language), I was able to make the following modifications:
v_1: I want v_1 to be the average value of the current, previous and previous-to-previous values of var_1 (i.e. index = n, index = n-1 and index = n-2). When this is not possible (e.g. for index = 2 and index = 1), I want this average to be for as back as you can go.
v_2: I want v_2 to be the average value of the current, previous and previous-to-previous values of var_2 (i.e. index = n, index = n-1 and index = n-2). When this is not possible (e.g. for index = 2 and index = 1), I want this average to be for as back as you can go.
This would be something like this:
Here is the code:
# source: https://stackoverflow.com/questions/72512399/r-recursive-averages
library(dplyr)
final = my_data %>%
mutate(
row = seq_along(id),
v_1 = (var_1 + lag(var_1, default = 0) + lag(var_1, 2, default = 0))/pmin(row, 3),
v_2 = (var_2 + lag(var_2, default = 0) + lag(var_2, 2, default = 0))/pmin(row, 3)
)
> head(final)
id var_1 var_2 row v_1 v_2
1 1 0 0 1 0.0000000 0.0000000
2 1 0 0 2 0.0000000 0.0000000
3 1 0 1 3 0.0000000 0.3333333
4 1 1 0 4 0.3333333 0.3333333
5 2 1 1 5 0.6666667 0.6666667
6 2 1 1 6 1.0000000 0.6666667
I would like to learn how to convert this R code into SQL code (e.g. something that can run with DBI). I found this website over here that shows how to do something similar (https://dwgeek.com/netezza-cumulative-sum-average-example.html/, bounded Cumulative sum in Netezza) and I tried to adapt the code for my problem:
# my 1st attempt
SELECT id,
var_1,var_2,
AVG(var_1) OVER(PARTITION BY id
ORDER BY var_1 ASC
rows between 2 preceding and current row) v_1, AVG(var_2) OVER(PARTITION BY id
ORDER BY var_2 ASC
rows between 2 preceding and current row) v_2
FROM my_data limit 100
# my 2nd attempt
SELECT id,
var_1,var_2,
AVG(var_1) OVER(PARTITION BY id
ORDER BY var_1 ASC
rows unbounded preceding) v_1, AVG(var_2) OVER(PARTITION BY id
ORDER BY var_2 ASC
rows unbounded preceding) v_2
FROM my_data limit 100
But I am not sure if I have done this correctly - can someone please show me how to correct this?
Thanks!
Why are you using the order by
clause? Also note that in your dplyr
code, you do not have the group_by(id)
hence you do not need the partition by
clause. simply do:
SELECT *,
AVG(var_1) over(ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) v_1,
AVG(var_2) over(ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) v_2
FROM my_data
Results
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbWriteTable(con, "my_data", my_data)
DBI::dbGetQuery(con,
"
SELECT *,
AVG(var_1) over(ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) v_1,
AVG(var_2) over(ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) v_2
FROM my_data
")
id var_1 var_2 v_1 v_2
1 1 1 1 1.0000000 1.0000000
2 1 0 0 0.5000000 0.5000000
3 1 0 1 0.3333333 0.6666667
4 1 0 1 0.0000000 0.6666667
5 2 1 0 0.3333333 0.6666667
6 2 0 1 0.3333333 0.6666667
7 2 0 0 0.3333333 0.3333333
8 3 1 0 0.3333333 0.3333333
9 4 1 1 0.6666667 0.3333333
10 4 1 1 1.0000000 0.6666667
11 5 0 1 0.6666667 1.0000000
12 5 0 1 0.3333333 1.0000000
13 5 0 0 0.0000000 0.6666667
14 5 0 0 0.0000000 0.3333333
15 5 0 0 0.0000000 0.0000000
Compare to
my_data %>%
mutate(
row = seq_along(id),
v_1 = (var_1 + lag(var_1, default = 0) + lag(var_1, 2, default = 0))/pmin(row, 3),
v_2 = (var_2 + lag(var_2, default = 0) + lag(var_2, 2, default = 0))/pmin(row, 3)
)
id var_1 var_2 row v_1 v_2
1 1 1 1 1 1.0000000 1.0000000
2 1 0 0 2 0.5000000 0.5000000
3 1 0 1 3 0.3333333 0.6666667
4 1 0 1 4 0.0000000 0.6666667
5 2 1 0 5 0.3333333 0.6666667
6 2 0 1 6 0.3333333 0.6666667
7 2 0 0 7 0.3333333 0.3333333
8 3 1 0 8 0.3333333 0.3333333
9 4 1 1 9 0.6666667 0.3333333
10 4 1 1 10 1.0000000 0.6666667
11 5 0 1 11 0.6666667 1.0000000
12 5 0 1 12 0.3333333 1.0000000
13 5 0 0 13 0.0000000 0.6666667
14 5 0 0 14 0.0000000 0.3333333
15 5 0 0 15 0.0000000 0.0000000