Search code examples
pythonpandascorrelation

Correlations between two series in Pandas


I'm trying to find the correlations between two series of different length with Pandas, but it' not giving me the expected answer.

Here's what I've attempted. I have two dataframes A and B with columns 'time' and 'values'. 'time' are dates that end on '2023-08-18', but start on different days in A and B (specifically, '2020-05-11' and '2017-10-24', respectively). I created

A_series = A.loc[:,'time']

B_series = B.loc[:,'time']

and tried to use

A_series.corr(B_series, method = 'pearson')

but didn't get the right answer. By plotting and using =CORREL in Excel, I know the correlation should be around 0.97, but Pandas gave me 0.045.

Since the issue was likely with the mismatch in the two time periods, I tried sorting A and B by the column 'time' (descending), then computed the correlation using Series.corr again, but got the same answer.

For my next attempt I will probably try to create a new dataframe with columns 'time', 'values_A', 'values_B', with 'time' including only the shorter period in A and B. But I'm curious why the above happened. Apologize for the rather simple question. I've just started learning Pandas this morning.


Solution

  • import pandas as pd
    import numpy as np
    
    # Here I created a date range for A from '2020-05-11' to '2023-08-18'
    dates_A = pd.date_range(start='2020-05-11', end='2023-08-18')
    values_A = np.random.rand(len(dates_A))
    A = pd.DataFrame({
        'time': dates_A,
        'values': values_A
    })
    
    # And here I did the same for B with different range
    dates_B = pd.date_range(start='2017-10-24', end='2023-08-18')
    values_B = np.random.rand(len(dates_B))
    B = pd.DataFrame({
        'time': dates_B,
        'values': values_B
    })
    
    # I made the column 'time' as index
    A.set_index('time', inplace=True)
    B.set_index('time', inplace=True)
    
    # Merged both A and B on the 'time' index
    merged_df = A.merge(B, left_index=True, right_index=True, suffixes=('_A', '_B'))
    
    # and finally I calculated the correlation between 'values_A' and 'values_B'
    correlation = merged_df['values_A'].corr(merged_df['values_B'], method='pearson')
    
    print("Correlation:", correlation)