Search code examples
sqlraveragecumulative-sumrolling-computation

Converting R code to SQL : Cumulative Averages and Rolling Averages


I have the following dataset ("my_data"):

> my_data = data.frame(id = c(1,1,1,1,2,2,2,3,4,4,5,5,5,5,5), var_1 = sample(c(0,1), 15, replace = TRUE) , var_2 =sample(c(0,1), 15 , replace = TRUE) )

> head(my_data)
  id var_1 var_2
1  1     0     0
2  1     0     0
3  1     0     1
4  1     1     0
5  2     1     1
6  2     1     1

Using the DPLYR library (R programming language), I was able to make the following modifications:

  • v_1: I want v_1 to be the average value of the current, previous and previous-to-previous values of var_1 (i.e. index = n, index = n-1 and index = n-2). When this is not possible (e.g. for index = 2 and index = 1), I want this average to be for as back as you can go.

  • v_2: I want v_2 to be the average value of the current, previous and previous-to-previous values of var_2 (i.e. index = n, index = n-1 and index = n-2). When this is not possible (e.g. for index = 2 and index = 1), I want this average to be for as back as you can go.

This would be something like this:

  • row 1 (id = 1, index = 1) : v_1 = var_1 (index 1)
  • row 2 (id = 1, index = 1 ; id = 1 index = 2) : v_1 = (var_1 (index 1) + var_1 (index 2))/2
  • row 3 (id = 1, index = 1 ; id = 1 index = 2; id = 1, index = 3) : v_1 = (var_1 (index 1) + - var_1 (index 2) + var_1 (index 3)) /3
  • row 4 (id = 1, index = 2 ; id = 1 index = 3; id = 1, index = 4) : v_1 = (var_1 (index 2) + - var_1 (index 3) + var_1 (index 4)) /3
  • etc.

Here is the code:

# source: https://stackoverflow.com/questions/72512399/r-recursive-averages
library(dplyr)

final = my_data %>% 
    mutate(
        row = seq_along(id),
        v_1 = (var_1 + lag(var_1, default = 0) + lag(var_1, 2, default = 0))/pmin(row, 3),
        v_2 = (var_2 + lag(var_2, default = 0) + lag(var_2, 2, default = 0))/pmin(row, 3)
    ) 

> head(final)
  id var_1 var_2 row       v_1       v_2
1  1     0     0   1 0.0000000 0.0000000
2  1     0     0   2 0.0000000 0.0000000
3  1     0     1   3 0.0000000 0.3333333
4  1     1     0   4 0.3333333 0.3333333
5  2     1     1   5 0.6666667 0.6666667
6  2     1     1   6 1.0000000 0.6666667

I would like to learn how to convert this R code into SQL code (e.g. something that can run with DBI). I found this website over here that shows how to do something similar (https://dwgeek.com/netezza-cumulative-sum-average-example.html/, bounded Cumulative sum in Netezza) and I tried to adapt the code for my problem:

# my 1st attempt
SELECT id,
       var_1,var_2,
       AVG(var_1) OVER(PARTITION BY id 
             ORDER BY var_1 ASC 
             rows between 2 preceding and current row) v_1, AVG(var_2) OVER(PARTITION BY id 
             ORDER BY var_2 ASC 
             rows between 2 preceding and current row) v_2
FROM   my_data limit 100


 # my 2nd attempt
SELECT id,
       var_1,var_2,
       AVG(var_1) OVER(PARTITION BY id 
             ORDER BY var_1 ASC 
             rows unbounded preceding) v_1, AVG(var_2) OVER(PARTITION BY id 
             ORDER BY var_2 ASC 
            rows unbounded preceding) v_2
FROM   my_data limit 100

But I am not sure if I have done this correctly - can someone please show me how to correct this?

Thanks!

  • I am using Netezza

Solution

  • Why are you using the order by clause? Also note that in your dplyr code, you do not have the group_by(id) hence you do not need the partition by clause. simply do:

    SELECT *,
      AVG(var_1) over(ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) v_1,
      AVG(var_2) over(ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) v_2 
    FROM my_data
    

    Results

    SQL

    con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
    DBI::dbWriteTable(con, "my_data", my_data)
    DBI::dbGetQuery(con, 
    "
    SELECT *,
      AVG(var_1) over(ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) v_1,
      AVG(var_2) over(ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) v_2 
    FROM my_data
    ")
      id var_1 var_2       v_1       v_2
    1   1     1     1 1.0000000 1.0000000
    2   1     0     0 0.5000000 0.5000000
    3   1     0     1 0.3333333 0.6666667
    4   1     0     1 0.0000000 0.6666667
    5   2     1     0 0.3333333 0.6666667
    6   2     0     1 0.3333333 0.6666667
    7   2     0     0 0.3333333 0.3333333
    8   3     1     0 0.3333333 0.3333333
    9   4     1     1 0.6666667 0.3333333
    10  4     1     1 1.0000000 0.6666667
    11  5     0     1 0.6666667 1.0000000
    12  5     0     1 0.3333333 1.0000000
    13  5     0     0 0.0000000 0.6666667
    14  5     0     0 0.0000000 0.3333333
    15  5     0     0 0.0000000 0.0000000
    

    Compare to

    R

    my_data %>% 
         mutate(
             row = seq_along(id),
             v_1 = (var_1 + lag(var_1, default = 0) + lag(var_1, 2, default = 0))/pmin(row, 3),
             v_2 = (var_2 + lag(var_2, default = 0) + lag(var_2, 2, default = 0))/pmin(row, 3)
         ) 
       id var_1 var_2 row       v_1       v_2
    1   1     1     1   1 1.0000000 1.0000000
    2   1     0     0   2 0.5000000 0.5000000
    3   1     0     1   3 0.3333333 0.6666667
    4   1     0     1   4 0.0000000 0.6666667
    5   2     1     0   5 0.3333333 0.6666667
    6   2     0     1   6 0.3333333 0.6666667
    7   2     0     0   7 0.3333333 0.3333333
    8   3     1     0   8 0.3333333 0.3333333
    9   4     1     1   9 0.6666667 0.3333333
    10  4     1     1  10 1.0000000 0.6666667
    11  5     0     1  11 0.6666667 1.0000000
    12  5     0     1  12 0.3333333 1.0000000
    13  5     0     0  13 0.0000000 0.6666667
    14  5     0     0  14 0.0000000 0.3333333
    15  5     0     0  15 0.0000000 0.0000000