Search code examples
pythonpandasdataframesamplesmoothing

'Oversampling' cartesian data in a dataframe without for loop?


I have a 3D data in a pandas dataframe that I would like to 'oversample'/smooth by replacing the value at each x,y point with the average value of all the points that are within 5 units of that point. I can do it using a for loop like this (starting with a dataframe with three columns X,Y,Z):

import pandas as pd

Z_OS = []
X_OS = []
Y_OS = []
for inddex, row in df.iterrows():
    Z_OS += [df[(df['X'] > row['X']-5) & (df['X']<row['X']+5) & (df['Y'] > row['Y']-5) & (df1['Y']<row['Y']+5)]['Z'].mean()]
    X_OS += [row['X']]
    Y_OS += [row['Y']]

dict = {
    'X': X_OS,
    'Y': Y_OS,
    'Z': Z_OS
}
OSdf = pd.DataFrame.from_dict(dict)

but this method is very slow for large datasets and feels very 'unpythonic'. How could I do this without for loops? Is it possible via complex use of the groupby function?


Solution

  • xy = df[['x','y']]
    df['smoothed z'] = df[['z']].apply(
        lambda row: df['z'][(xy - xy.loc[row.name]).abs().lt(5).all(1)].mean(),
        axis=1
    )
    
    • Here I used df[['z']] to get a column 'z' as a data frame. We need an index of a row, i.e. row.name, when we apply a function to this column.
    • .abs().lt(5).all(1) read as absolut values which are all less then 5 along the row.

    Update

    The code below is actually the same but seems more consistent as it addresses directly the index:

    df.index.to_series().apply(lambda i: df.loc[(xy - xy.loc[i]).abs().lt(5).all(1), 'z'].mean())