Search code examples
rpostgresqldplyrsmoothingdbplyr

Algorithm in R to smooth out a vector while retaining rank order


I need to write a function that can smooth a vector without losing the original rank order of the vector's values. What I've come up with is the following:

#1 Sort all values of vector in ascending order
#2 for the kth value in vector s_k in the ordered list, collect the list of 2N+1 values in the window of values between [s_{k-N}, s_{k+N}]
#3 by definition, s_k is the median of the values in that window
#4 replace s_k with the mean of value in that same window for all values of k

Ideally, I'd like to be able to write a function that relies on dbplyr as I'm working with remote data but this is not absolutely necessary as I can break the data out in chunks so base R is OK too. Similarly, this could also be all postgressql code or paritial sql partial dbplyr and it would be just the same but there are a few requirements. I need to be able to parameterize N and i need to be able to feed the function a list of dataframes or set of tables(if in a database) to loop through (in R this is simple, a function with a single parameter for N inside a lapply wrapper).

Here's what I got thus far for N=3:

#Example Data
s <- rnorm(1000, mean=50, sd=10)
test.in <- as.data.frame(s)
test.in$id <- 1:length(s)

#Non parameterized attempt 
test.out <- test.in %>%
  rename(s = union_v_corporate_candidate) %>%
  mutate(lag_k_3 = lag(s, 3), 
         lead_k_3 = lead(s, 3),
         lag_k_2 = lag(s, 2), 
         lead_k_2 = lead(s, 2),
         lag_k_1 = lag(s, 1), 
         lead_k_1 = lead(s, 1)) %>%
  mutate(window_mean = (lag_k_3 + lead_k_3 + lag_k_2 + lead_k_2 + lag_k_1 + lead_k_1 + s)/7) %>%
  select(id, s, window_mean)

The problem with the logic of above approach is that I wont be able parameterize N as each additional N value would require two additional mutate clauses.


Solution

  • What you are looking for is called Window Frames in SQL. I am referencing from this and this link. In SQL such a command might look like:

    SELECT Col1
        ,Col2
        ,SUM(Col2) OVER(ORDER BY Col1 ROWS BETWEEN N PRECEDING AND N FOLLOWING) AS window_sum
    FROM db.table
    

    Where N is the parameter of how many rows forward and back from the present row to look. So the command above produces the 2N+1 row moving sum.

    In dbplyr this functionality is provided by window_order and window_frame. Official reference here and an alternative here.

    Based on their examples, you probably want something like the following:

    N = 3
    
    test_out = test_in %>%
      group_by(id) %>% # find the moving mean for each group separately
      window_order(s) %>% # how should your data be sorted (think 'arrange') often date
      window_frame(-N, N) %>% # set width of window
      mutate(window_mean = mean(s))
    
    # check SQL produced
    sql_build(test_out)
    # or
    show_query(test_out)
    

    I strongly recommend you check the resulting SQL to ensure your R code is doing what you think it is.