Search code examples
pythonsignal-processingpearson-correlation

Python: how to find correlation between two values and remove noise?


I have two curves A and B like that are highly correlated as shown in the figure below where C is the Pearson correlation between A and B.

The file containing the data can be downloaded here.

import numpy as np
import pandas as pd
import pylab as plt

df = pd.read_csv('prova.csv')
A = df['A'].values
B = df['B'].values
from scipy.stats.stats import pearsonr 
C = pearsonr(A,B)[0]


fig, ax = plt.subplots(1,2, figsize=(20, 5))
ax1 = ax[0]
ax2 = ax1.twinx()
ax1.plot(A, 'g-')
ax2.plot(B, 'b-')
ax1.set_ylabel('A', color='g', fontsize=20);
ax2.set_ylabel('B', color='b', fontsize=20);


ax2 = ax[1]
txt = 'C = %.2f'%C
ax2.scatter(A, B, label=txt)
ax2.set_xlabel('A', color='g', fontsize=20);
ax2.set_ylabel('B', color='b', fontsize=20);
ax2.legend(fontsize = 16)

The values of the green curve should be 0 but the signal is affected by B. I would like to find the relation between A and B in order to be for A and B to cancel out, but I am unsure how to proceed.

Data and correlation plot


Solution

  • Clearly, A and B predict each other quite well. We can exploit this to ensure we obtain a value at about 0 given values of A and B. My method of choice is the least_squares fit.

    We want to minimize A - x * B - c for some parameters x and c. This can be done using,

    import matplotlib.pyplot as plt
    import pandas as pd
    import scipy.optimize as opt
    
    
    df = pd.read_csv('prova.csv')
    
    def fit(x):
        return df['A'] - x[0] * df['B'] - x[1]
    
    
    result = opt.least_squares(fit, [0, 0])
    
    fit(result.x).plot()
    plt.show()
    

    This results in,

    Result

    Which is many orders of magnitude closer to zero.