Search code examples
pythonpandasstatistics

Calculating the correlation coefficient of time series data of unqual length


Suppose you have a dataframe like this

data = {'site': ['A', 'A', 'B', 'B', 'C', 'C'],
        'item': ['x', 'x', 'x', 'x', 'x', 'x'],
         'date': ['2023-03-01', '2023-03-10', '2023-03-20', '2023-03-27', '2023-03-5', '2023-03-12'],
         'quantity': [10,20,30, 20, 30, 50]}
df_sample = pd.DataFrame(data=data)
df_sample.head()

Where you have different sites and items with a date and quantity. Now, what you want to do is calculate the correlation between say site A and site B for item x and their associated quantity. Although, they could be of different length in the dataframe. How would you go about doing this.

The actual data in consideration here can be found here here.

Now, what I tried was just setting up two different dataframes like this

df1 = df_sample[(df_sample['site'] == 'A']) & (df_sample['item'] == 'x')]
df2 = df_sample[(df_sample['site'] == 'B']) & (df_sample['item'] == 'x')]

then just force them to have the same size, and calculate the correlation coefficient from there but I am sure there is a better way to do this.


Solution

  • Reshape to wide form with pivot_table and add zeros to missing data points, this will allow a correct comparison. You can then select the item you want and compute the correlation of all combinations of columns with corr:

    tmp = df_sample.pivot_table(index='date', columns=['item', 'site'],
                                values='quantity', fill_value=0)
    
    out = tmp['x'].corr()
    

    Output:

    site         A         B         C
    site                              
    A     1.000000 -0.449618 -0.442627
    B    -0.449618  1.000000 -0.464363
    C    -0.442627 -0.464363  1.000000
    

    Intermediate tmp:

    item           x            
    site           A     B     C
    date                        
    2023-03-01  10.0   0.0   0.0
    2023-03-10  20.0   0.0   0.0
    2023-03-12   0.0   0.0  50.0
    2023-03-20   0.0  30.0   0.0
    2023-03-27   0.0  20.0   0.0
    2023-03-5    0.0   0.0  30.0