python numpy convolution rolling-computation moving-average

Compute moving average with non-uniform domain

How to calculate rolling / moving average using python + NumPy / SciPy? discusses the situation when the observations are equally spaced, i.e., the index is equivalent to an integer range.

In my case, the observations come at arbitrary times and the interval between them can be an arbitrary float. E.g.,

import pandas as pd
import numpy as np

df = pd.DataFrame({"y":np.random.uniform(size=100)}, index=np.random.uniform(size=100)).sort_index()

I want to add a column yavg to df whose value at a give index value x0 is

sum(df.y[x]*f(x0-x) for x in df.index) / sum(f(x0-x) for x in df.index)

for a given function f, e.g.,

def f(x):
    return np.exp(-x*x)

How do I do this with a minimal effort (preferably in pure numpy)?

Solution

I think you can do something like this:

index_np_arr = df.index.values
weighted_sum = np.sum(df['y'].values[:, np.newaxis] * f(index_np_arr - index_np_arr[:, np.newaxis]), axis=0)
entire_sum = np.sum(f(index_np_arr[:, np.newaxis] - index_np_arr), axis=0)


df['yavg'] = pd.Series(weighted_sum/entire_sum, index=df.index)

Basically:

index_np_arr is the np.array of all possible x0 values;
entire_sum would, get the denominator for all values in the index by repeating the vector n times, where n is the number of indexes and then subtracting for each x0. Finally it would sum it all up;
weighted_sum would do almost the same thing except before we sum we would multiply by the y vector.

Complete code:

import pandas as pd
import numpy as np

def f(x):
    return np.exp(-x*x)

df = pd.DataFrame({"y":np.random.uniform(size=100)}, index=np.random.uniform(size=100)).sort_index()

index_np_arr = df.index.values
weighted_sum = np.sum(df['y'].values[:, np.newaxis] * f(index_np_arr - index_np_arr[:, np.newaxis]), axis=0)
entire_sum = np.sum(f(index_np_arr[:, np.newaxis] - index_np_arr), axis=0)


df['yavg'] = pd.Series(weighted_sum/entire_sum, index=df.index)

Note: This code does have a high memory usage because you will create an array of shape (n, n) for computing the sums using vectorized functions, but is probably faster than iterating over all values of x.