python machine-learning signal-processing kaggle

KNeighborsRegressor as denoising algorithm

On Kaggle I have found algorithms used for signals denoising. Such as Golay filters, spline functions, Autoregressive modelling or KNeighborsRegressor itself.

Link: https://www.kaggle.com/residentmario/denoising-algorithms

How exactly does it work as I cannot find any article explaining its use for signal denoising? What kind of algorithm is it? I would like to understand how it works

Solution

It is a supervised learning algorithm - that is the best answer,

normally the algorithm is first trained with known data and it tries to interpret a function that best represents that data such that a new point can be produced for a previously unseen input.

Put simply it will determine the point for a previously unseen value based on an average of the k nearest points for which it has previously seen, a better, more detailed answer can be found below: https://towardsdatascience.com/the-basics-knn-for-classification-and-regression-c1e8a6c955

in the kaggle code:

the time vector is:

df.index.values[:, np.newaxis]

and the signal vector is:

df.iloc[:, 0]

it appears the person in kaggle is using the data to first train the network - see below:

## define the KNN network
clf = KNeighborsRegressor(n_neighbors=100, weights='uniform')
## train the network 
clf.fit(df.index.values[:, np.newaxis], 
        df.iloc[:, 0])

giving him a function that represents the relationship between time and the signal value. With this he then passes the time vector back to the network to get it to reproduce the signal.

y_pred = clf.predict(df.index.values[:, np.newaxis])

this new signal will represent the model's best interpretation of the signal, as you can see from the the link I have posted above, you can adjust certain parameters which will result in a 'cleaner' signal but also could degrade the original signal

One thing to note is that using this method in the same way as that guy in kaggle means it would only work for that one signal since the input is time it cannot be used to interpret future values:

y_pred = clf.predict(df.index.values[:, np.newaxis] + 400000)
ax = pd.Series(df.iloc[:, 0]).plot(color='lightgray')
pd.Series(y_pred).plot(color='black', ax=ax, figsize=(12, 8))