Search code examples
pythonpandasdataframecorrelationscalar

How to correlate scalar values of two pandas dataframes


How do I correlate two pandas dataframes, find a single r value for all values? I don't want to correlate columns or rows, but all scalar values. One dataframe is the x axis, and the other dataframe is the y axis.

I downloaded identically structured csv files here: https://www.gapminder.org/data/ The tables have years for columns, countries for rows, with numerical values for the indicator that each table reports.

For instance, I want to see how the Political Participation Indicator (gapminder calls it an index, but I don't want to confuse it with a dataframe index) correlates overall with the Government Functioning Indicator, by year and country.

pol_partix_idx_EIU_df = pd.read_csv('polpartix_eiu.csv',index_col=0)
govt_idx_EIU_df = pd.read_csv('gvtx_eiu.csv',index_col=0)

pol_partix_idx_EIU_df.head()

    2006    2007    2008    2009    2010    2011    2012    2013    2014    2015    2016    2017    2018
country                                                 
Afghanistan 0.222   0.222   0.222   0.250   0.278   0.278   0.278   0.278   0.389   0.389   0.278   0.278   0.444
Albania 0.444   0.444   0.444   0.444   0.444   0.500   0.500   0.500   0.500   0.556   0.556   0.556   0.556
Algeria 0.222   0.194   0.167   0.223   0.278   0.278   0.389   0.389   0.389   0.389   0.389   0.389   0.389
Angola  0.111   0.250   0.389   0.416   0.444   0.444   0.500   0.500   0.500   0.500   0.556   0.556   0.556
Argentina   0.556   0.556   0.556   0.556   0.556   0.556   0.556   0.556   0.556   0.611   0.611   0.611   0.611

You can correlate by column or row:

pol_partix_idx_EIU_df.corrwith(govt_idx_EIU_df, axis=0)

2006    0.738297

2007    0.745321

2008    0.731913

...

2018    0.718520

dtype: float64


pol_partix_idx_EIU_df.corrwith(govt_idx_EIU_df, axis=1)

country

Afghanistan    6.790123e-01

Albania       -5.664265e-01

...

Zimbabwe       4.456537e-01

Length: 164, dtype: float64

But, I want a single r value that compares every field in one table with every corresponding field in the other table. Essentially, I want the r value of this scatterplot:

plt.scatter(pol_cultx_idx_EIU_df,govt_idx_EIU_df)
plt.xlabel('Political participation')
plt.ylabel('Government functioning')

(The example code won't color the plot like this, but plots the same points.)

The example code won't color the plot like this, but plots the same points.

The second part of the question would be how to do this with tables that aren't exactly identical in structure. Every table (dataframe) I want to compare has country records and year columns, but not all of them have the same countries or years. In the example above, they do. How do I get a single r value for only the shared rows and columns of the dataframes?


Solution

  • I've simulated a setup that I think mimics yours--three dataframes with countries across rows and years across columns. I then concatenate the different sets of data into a single dataframe. And show how to compute the correlation between them. Let me know if something about this example doesn't match your setup.

    import pandas as pd
    
    set1 = pd.DataFrame({1980:[4, 11, 0], 1981:[5, 10, 2], 1982:[0, 3, 1]},
        index=pd.Index(['USA', 'UK', 'Iran'], name='country'))
    set1.columns.name = 'year'
    set1
    
    year     1980  1981  1982
    country                  
    USA         4     5     0
    UK         11    10     3
    Iran        0     2     1
    
    set2 = pd.DataFrame({1981:[2, 1, 10], 1982:[15, 1, 12], 1983:[10, 13, 1]},
        index=pd.Index(['USA', 'UK', 'Turkey'], name='country'))
    set2.columns.name = 'year'
    set2
    
    year     1981  1982  1983
    country                  
    USA         2    15    10
    UK          1     1    13
    Turkey     10    12     1
    

    Notice that, like your setup, some countries/years are not present in different datasets.

    set3 = pd.DataFrame({1980:[12, 11, 4], 1982:[9, 8, 11]},
        index=pd.Index(['USA', 'UK', 'Turkey'], name='country'))
    set3.columns.name = 'year'
    

    We can turns these into multi-indexed series by stacking along year and then concatenate these across columns using pd.concat.

    df = pd.concat([set1.stack('year'), set2.stack('year'), set3.stack('year')],
        keys=['set1', 'set2', 'set3'], names=['set'], axis=1)
    df
    
    set           set1  set2  set3
    country year                  
    Iran    1980   0.0   NaN   NaN
            1981   2.0   NaN   NaN
            1982   1.0   NaN   NaN
    Turkey  1980   NaN   NaN   4.0
            1981   NaN  10.0   NaN
            1982   NaN  12.0  11.0
            1983   NaN   1.0   NaN
    UK      1980  11.0   NaN  11.0
            1981  10.0   1.0   NaN
            1982   3.0   1.0   8.0
            1983   NaN  13.0   NaN
    USA     1980   4.0   NaN  12.0
            1981   5.0   2.0   NaN
            1982   0.0  15.0   9.0
            1983   NaN  10.0   NaN
    

    And we can compute a 3x3 correlation matrix across the three different sets.

    df.corr()
    
    set       set1      set2      set3
    set                               
    set1  1.000000 -0.723632  0.509902
    set2 -0.723632  1.000000  0.606891
    set3  0.509902  0.606891  1.000000