I am trying to perform PCA on a multiindex which gives a correlation matrix on several days. For each of those days I would like to perform PCA on the correlation matrices. Any help is appreciated.
The DataFrame: rolling_cor_monthly (6140 rows × 10 columns):
NoDur Durbl Manuf Enrgy HiTec Telcm Shops Hlth Utils Other
Date level_1
2021-01-31 NoDur 1.00000 0.62369 0.87367 0.65322 0.74356 0.84011 0.77417 0.80183 0.82833 0.84094
Durbl 0.62369 1.00000 0.69965 0.57501 0.70125 0.60104 0.68652 0.61333 0.45301 0.70556
Manuf 0.87367 0.69965 1.00000 0.78599 0.81415 0.84477 0.80932 0.82127 0.74803 0.94673
Enrgy 0.65322 0.57501 0.78599 1.00000 0.59940 0.67492 0.58058 0.61946 0.57830 0.81593
HiTec 0.74356 0.70125 0.81415 0.59940 1.00000 0.75436 0.91318 0.84508 0.59302 0.81109
Telcm 0.84011 0.60104 0.84477 0.67492 0.75436 1.00000 0.77555 0.77342 0.73186 0.85595
Shops 0.77417 0.68652 0.80932 0.58058 0.91318 0.77555 1.00000 0.81197 0.61574 0.79932
Hlth 0.80183 0.61333 0.82127 0.61946 0.84508 0.77342 0.81197 1.00000 0.70032 0.80875
Utils 0.82833 0.45301 0.74803 0.57830 0.59302 0.73186 0.61574 0.70032 1.00000 0.72739
Other 0.84094 0.70556 0.94673 0.81593 0.81109 0.85595 0.79932 0.80875 0.72739 1.00000
2021-02-28 NoDur 1.00000 0.61544 0.87041 0.64622 0.73941 0.83792 0.77075 0.79993 0.82813 0.83937
Durbl 0.61544 1.00000 0.69464 0.55865 0.70203 0.59109 0.68265 0.60963 0.44792 0.69685
Manuf 0.87041 0.69464 1.00000 0.78243 0.81121 0.84189 0.80395 0.81809 0.74489 0.94605
Enrgy 0.64622 0.55865 0.78243 1.00000 0.58911 0.67134 0.56925 0.61252 0.56865 0.81365
HiTec 0.73941 0.70203 0.81121 0.58911 1.00000 0.74904 0.91274 0.84179 0.58973 0.80581
Telcm 0.83792 0.59109 0.84189 0.67134 0.74904 1.00000 0.77078 0.76844 0.72814 0.85493
Shops 0.77075 0.68265 0.80395 0.56925 0.91274 0.77078 1.00000 0.80924 0.61446 0.79342
Hlth 0.79993 0.60963 0.81809 0.61252 0.84179 0.76844 0.80924 1.00000 0.69965 0.80394
Utils 0.82813 0.44792 0.74489 0.56865 0.58973 0.72814 0.61446 0.69965 1.00000 0.72542
Other 0.83937 0.69685 0.94605 0.81365 0.80581 0.85493 0.79342 0.80394 0.72542 1.00000
The code I tried:
eigenvalues, eigenvectors = LA.eig(rolling_cor_monthly)
idx = eigenvalues.argsort()[::-1]
D = pd.DataFrame(data = np.diag(eigenvalues[idx]))
P = pd.DataFrame(data = eigenvectors[:,idx])
The error:
LinAlgError: Last 2 dimensions of the array must be square
The output I hope to obtain is of the same format as the dataframe.
Many thanks in advance!
This requires working with the additional dimension, so it will be a bit more involved:
import numpy as np
import numpy.linalg as LA
import pandas as pd
# convert dataframe to a 3-d array (the new axis will correspond to date index)
arr = df.values[np.newaxis,:,:].reshape((len(df.index.levels[0]),10,10))
# get eigenvalues (n x 10) and eigenvectors (n x 10 x 10)
eigenvalues, eigenvectors = LA.eig(arr)
The rest of your code (sorting and converting to dataframes) can be written as:
eigenvalues = np.sort(eigenvalues, axis=1)[:, ::-1]
# can also use this to sort:
# idx = eigenvalues.argsort()[:, ::-1]
# eigenvalues = np.take_along_axis(eigenvalues, idx, axis=1))
D = pd.DataFrame(
np.apply_along_axis(np.diag, 1, eigenvalues).reshape(-1,10),
index=df.index
)
eigenvectors = np.sort(eigenvectors, axis=1)[:, ::-1]
P = pd.DataFrame(
eigenvectors.reshape(-1,10),
index=df.index
)