Search code examples
pythonpandasnumpycorrelationcovariance

Create covariance matrix on columns that are not full


I'm looking to create a variance/covariance matrix or a correlation matrix (I don't care which) on stock data.

The issue is that different symbols trade at different times, so my dataframe is not full.

How would I generate a covariance matrix on a dataset that has a timestamp column, and where some entries are null.

Here would be an example.

            timestamp   usdjpy  AAPL  NIKKEI
0 2021-01-01 00:00:00        4   NaN     NaN
1 2021-01-01 00:05:00        5     5     NaN
2 2021-01-01 00:10:00        6     6       9
3 2021-01-01 00:15:00        4     2       4
4 2021-01-01 00:20:00        2   NaN       3
5 2021-01-01 00:25:00        7   NaN       7

What I would want here is the covariance between usdjpy and aapl to be computed on their overlapping timestamps.


Solution

  • Just use cov, this will ignore the NaNs by default:

    out = df.drop(columns=['timestamp']).cov()
    

    Output:

              usdjpy       AAPL     NIKKEI
    usdjpy  3.066667   2.000000   5.250000
    AAPL    2.000000   4.333333  10.000000
    NIKKEI  5.250000  10.000000   7.583333
    

    If you just want to compute the covariance between the tickers and usdjpy:

    out = df[['AAPL', 'NIKKEI']].apply(lambda s: s.cov(df['usdjpy']))
    
    # equivalent to
    # df[['AAPL', 'NIKKEI']].apply(lambda s: s.dropna().cov(df['usdjpy']))
    

    Output:

    AAPL      2.00
    NIKKEI    5.25
    dtype: float64
    

    For the correlation, use corr:

    df.drop(columns=['timestamp']).corr()
    
              usdjpy      AAPL    NIKKEI
    usdjpy  1.000000  0.960769  0.859793
    AAPL    0.960769  1.000000  1.000000
    NIKKEI  0.859793  1.000000  1.000000
    
    
    df[['AAPL', 'NIKKEI']].corrwith(df['usdjpy'], axis=0)
    
    AAPL      0.960769
    NIKKEI    0.859793
    dtype: float64
    

    Excerpt of the cov documentation on NaNs:

    Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as NaN.

    Notes

    For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.

    However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.