Search code examples
pythonpandassignal-processingcorrelationcross-correlation

Why is this correlation coefficient given by pandas unrealistically low?


I am coding in python and I am correlating a row in pandas (index 2500) with a sinusoidal function that I defined (sine_modulation). When I print the value I obtain by using

row_correlation(saved_data_DAQ.iloc[2500].values, sine_modulation(time_measurement,modulation_frequency_axion))

where row_correlation(f,g) is just defined as np.corrcoef(f, g)[0, 1] I obtain 0.23. However, if I plot both functions I can visually see an extemely high degree of correlation (see image). This is expected because the blue curve is just random white noise (from a gaussian distribution) plus a constant times the sine modulation itself (blue = noise + C*red where C=0.002)

I would like to know why the correlation computed by this function is so low, but more importantly, do you have any idea or suggestion on how to compute a correlation that better reflects the high degree of correlation between my two functions?

Visual inspection of both functions (row 2500 of dataframe and sine modulation)

You can also see zoom-in below

Zoom in showing also the cadence of the data points

NOTE that it may as well be that the correlation is right and it is 0.23, then my question would be the following: what other quantity could I compute to show wether my noise has an oscillation component or not? I saw the word "synchronization" on the comments, maybe this is the right quantity to compute?


Solution

  • I did a short example for you to see where you are getting the low R values. Let's consider a pure positive sine:

    N = 2500 # number of samples
    t = np.linspace(0,1, N) # time going to 1 seconds I guess
    Fs = N/t[-1] # sampling rate
    sine = (np.sin(4*np.pi*t-np.pi/2)+1)/2 # positive sine wave
    

    Since you did not add your code,I assumed your noise looks something like this:

    noise = abs(np.random.normal(0,0.1,len(t))) # random 
    

    Finally, let's define the coefficient you are multiplying the sine wave with. Let's set it as going from 0.001 to 1 in a linear space with 100 samples:

    C = np.linspace(0.001, 1, 100) # pure sine coefficient
    

    If we loop through those values and we generate the noisy signal with sineWithNoise = c*sine + noise, we get the following results:

    oGIF

    To know the actual value of c, look at the xlabel of the third subplot (the right most axes).

    Most importantly, I think you need to see the scatter plot, as the calculation of the correlation coefficient relies on comparing the two signals against each other (source for image):

    pearson

    and not comparing the two signals in time against each other (source for image):

    correlate

    To use cross-correlation, you can use:

    from scipy.signal import correlate, correlation_lags
    xcorr = correlate(sine, sineWithNoise) # generated sineWithNoise = 0.2*sine + noise
    lags = correlation_lags(N,N)/Fs # get lags in seconds
    plt.figure()
    plt.plot(lags, xcorr)
    plt.grid()
    plt.xlabel("Lags (~s)") # xlabel
    plt.ylabel("Cross-correlation") # ylabel
    plt.axvline(0) # perfect scenario peak without any shift
    

    To get the following results:

    crosscorr

    To get how well synchronised they are, you need to see if the maximum is indeed without any shift:

    idxMax = np.argmax(xcorr) # get arg of maximum
    print(lags[idxMax]) # print corresponding lag
    # 0.0008, almost zero
    

    Hope this helps you