I'm trying to find the correlations between two series of different length with Pandas, but it' not giving me the expected answer.
Here's what I've attempted. I have two dataframes A and B with columns 'time' and 'values'. 'time' are dates that end on '2023-08-18', but start on different days in A and B (specifically, '2020-05-11' and '2017-10-24', respectively). I created
A_series = A.loc[:,'time']
B_series = B.loc[:,'time']
and tried to use
A_series.corr(B_series, method = 'pearson')
but didn't get the right answer. By plotting and using =CORREL in Excel, I know the correlation should be around 0.97, but Pandas gave me 0.045.
Since the issue was likely with the mismatch in the two time periods, I tried sorting A and B by the column 'time' (descending), then computed the correlation using Series.corr again, but got the same answer.
For my next attempt I will probably try to create a new dataframe with columns 'time', 'values_A', 'values_B', with 'time' including only the shorter period in A and B. But I'm curious why the above happened. Apologize for the rather simple question. I've just started learning Pandas this morning.
import pandas as pd
import numpy as np
# Here I created a date range for A from '2020-05-11' to '2023-08-18'
dates_A = pd.date_range(start='2020-05-11', end='2023-08-18')
values_A = np.random.rand(len(dates_A))
A = pd.DataFrame({
'time': dates_A,
'values': values_A
})
# And here I did the same for B with different range
dates_B = pd.date_range(start='2017-10-24', end='2023-08-18')
values_B = np.random.rand(len(dates_B))
B = pd.DataFrame({
'time': dates_B,
'values': values_B
})
# I made the column 'time' as index
A.set_index('time', inplace=True)
B.set_index('time', inplace=True)
# Merged both A and B on the 'time' index
merged_df = A.merge(B, left_index=True, right_index=True, suffixes=('_A', '_B'))
# and finally I calculated the correlation between 'values_A' and 'values_B'
correlation = merged_df['values_A'].corr(merged_df['values_B'], method='pearson')
print("Correlation:", correlation)