I need to write a function that can smooth a vector without losing the original rank order of the vector's values. What I've come up with is the following:
#1 Sort all values of vector in ascending order
#2 for the kth value in vector s_k in the ordered list, collect the list of 2N+1 values in the window of values between [s_{k-N}, s_{k+N}]
#3 by definition, s_k is the median of the values in that window
#4 replace s_k with the mean of value in that same window for all values of k
Ideally, I'd like to be able to write a function that relies on dbplyr
as I'm working with remote data but this is not absolutely necessary as I can break the data out in chunks so base R is OK too. Similarly, this could also be all postgressql code or paritial sql partial dbplyr
and it would be just the same but there are a few requirements. I need to be able to parameterize N
and i need to be able to feed the function a list of dataframes or set of tables(if in a database) to loop through (in R this is simple, a function with a single parameter for N
inside a lapply
wrapper).
Here's what I got thus far for N=3
:
#Example Data
s <- rnorm(1000, mean=50, sd=10)
test.in <- as.data.frame(s)
test.in$id <- 1:length(s)
#Non parameterized attempt
test.out <- test.in %>%
rename(s = union_v_corporate_candidate) %>%
mutate(lag_k_3 = lag(s, 3),
lead_k_3 = lead(s, 3),
lag_k_2 = lag(s, 2),
lead_k_2 = lead(s, 2),
lag_k_1 = lag(s, 1),
lead_k_1 = lead(s, 1)) %>%
mutate(window_mean = (lag_k_3 + lead_k_3 + lag_k_2 + lead_k_2 + lag_k_1 + lead_k_1 + s)/7) %>%
select(id, s, window_mean)
The problem with the logic of above approach is that I wont be able parameterize N
as each additional N
value would require two additional mutate clauses.
What you are looking for is called Window Frames in SQL. I am referencing from this and this link. In SQL such a command might look like:
SELECT Col1
,Col2
,SUM(Col2) OVER(ORDER BY Col1 ROWS BETWEEN N PRECEDING AND N FOLLOWING) AS window_sum
FROM db.table
Where N
is the parameter of how many rows forward and back from the present row to look. So the command above produces the 2N+1
row moving sum.
In dbplyr this functionality is provided by window_order
and window_frame
. Official reference here and an alternative here.
Based on their examples, you probably want something like the following:
N = 3
test_out = test_in %>%
group_by(id) %>% # find the moving mean for each group separately
window_order(s) %>% # how should your data be sorted (think 'arrange') often date
window_frame(-N, N) %>% # set width of window
mutate(window_mean = mean(s))
# check SQL produced
sql_build(test_out)
# or
show_query(test_out)
I strongly recommend you check the resulting SQL to ensure your R code is doing what you think it is.