Search code examples
pythondataframepcamulti-index

Apply PCA on Multiindex for several dates


I am trying to perform PCA on a multiindex which gives a correlation matrix on several days. For each of those days I would like to perform PCA on the correlation matrices. Any help is appreciated.

The DataFrame: rolling_cor_monthly (6140 rows × 10 columns):

                     NoDur   Durbl   Manuf   Enrgy   HiTec   Telcm   Shops   Hlth    Utils   Other
Date        level_1                                     
2021-01-31  NoDur    1.00000 0.62369 0.87367 0.65322 0.74356 0.84011 0.77417 0.80183 0.82833 0.84094
            Durbl    0.62369 1.00000 0.69965 0.57501 0.70125 0.60104 0.68652 0.61333 0.45301 0.70556
            Manuf    0.87367 0.69965 1.00000 0.78599 0.81415 0.84477 0.80932 0.82127 0.74803 0.94673
            Enrgy    0.65322 0.57501 0.78599 1.00000 0.59940 0.67492 0.58058 0.61946 0.57830 0.81593
            HiTec    0.74356 0.70125 0.81415 0.59940 1.00000 0.75436 0.91318 0.84508 0.59302 0.81109
            Telcm    0.84011 0.60104 0.84477 0.67492 0.75436 1.00000 0.77555 0.77342 0.73186 0.85595
            Shops    0.77417 0.68652 0.80932 0.58058 0.91318 0.77555 1.00000 0.81197 0.61574 0.79932
            Hlth     0.80183 0.61333 0.82127 0.61946 0.84508 0.77342 0.81197 1.00000 0.70032 0.80875
            Utils    0.82833 0.45301 0.74803 0.57830 0.59302 0.73186 0.61574 0.70032 1.00000 0.72739
            Other    0.84094 0.70556 0.94673 0.81593 0.81109 0.85595 0.79932 0.80875 0.72739 1.00000
2021-02-28  NoDur    1.00000 0.61544 0.87041 0.64622 0.73941 0.83792 0.77075 0.79993 0.82813 0.83937
            Durbl    0.61544 1.00000 0.69464 0.55865 0.70203 0.59109 0.68265 0.60963 0.44792 0.69685 
            Manuf    0.87041 0.69464 1.00000 0.78243 0.81121 0.84189 0.80395 0.81809 0.74489 0.94605
            Enrgy    0.64622 0.55865 0.78243 1.00000 0.58911 0.67134 0.56925 0.61252 0.56865 0.81365
            HiTec    0.73941 0.70203 0.81121 0.58911 1.00000 0.74904 0.91274 0.84179 0.58973 0.80581
            Telcm    0.83792 0.59109 0.84189 0.67134 0.74904 1.00000 0.77078 0.76844 0.72814 0.85493
            Shops    0.77075 0.68265 0.80395 0.56925 0.91274 0.77078 1.00000 0.80924 0.61446 0.79342
            Hlth     0.79993 0.60963 0.81809 0.61252 0.84179 0.76844 0.80924 1.00000 0.69965 0.80394
            Utils    0.82813 0.44792 0.74489 0.56865 0.58973 0.72814 0.61446 0.69965 1.00000 0.72542
            Other    0.83937 0.69685 0.94605 0.81365 0.80581 0.85493 0.79342 0.80394 0.72542 1.00000

The code I tried:

eigenvalues, eigenvectors = LA.eig(rolling_cor_monthly)
idx = eigenvalues.argsort()[::-1]   
D = pd.DataFrame(data = np.diag(eigenvalues[idx]))
P = pd.DataFrame(data = eigenvectors[:,idx])

The error:

 LinAlgError: Last 2 dimensions of the array must be square

The output I hope to obtain is of the same format as the dataframe.

Many thanks in advance!


Solution

  • This requires working with the additional dimension, so it will be a bit more involved:

    import numpy as np
    import numpy.linalg as LA
    import pandas as pd
    
    # convert dataframe to a 3-d array (the new axis will correspond to date index)
    arr = df.values[np.newaxis,:,:].reshape((len(df.index.levels[0]),10,10))
    
    # get eigenvalues (n x 10) and eigenvectors (n x 10 x 10)
    eigenvalues, eigenvectors = LA.eig(arr)
    

    The rest of your code (sorting and converting to dataframes) can be written as:

    eigenvalues = np.sort(eigenvalues, axis=1)[:, ::-1]
    # can also use this to sort:
    # idx = eigenvalues.argsort()[:, ::-1]
    # eigenvalues = np.take_along_axis(eigenvalues, idx, axis=1))
    
    D = pd.DataFrame(
        np.apply_along_axis(np.diag, 1, eigenvalues).reshape(-1,10),
        index=df.index
    )
    
    eigenvectors = np.sort(eigenvectors, axis=1)[:, ::-1]
    P = pd.DataFrame(
        eigenvectors.reshape(-1,10),
        index=df.index
    )