For the sake of simplicity and reproducability, here is the code for generating a sample dataset:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
data = pd.DataFrame({
'year':['2001', '2002', '2003', '2004', '2005', '2001', '2002', '2003', '2004', '2005', '2006', '2001', '2002', '2003', '2004', '2005', '2006', '2001', '2002', '2003', '2004', '2005', '2006'],
'ID':[1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4],
'factor':[np.nan, 0.45, .4, .2, -0.3, np.nan, .11, .21, .4, .01, np.nan, -0.32, 0.93, 0.66, np.nan, 0.5, np.nan, -0.12, -0.14, 0.36, 0.3, 0.21, np.nan],
'return':[.11, 0.45, .34, .52, -0.93, 1.54, 1.01, .31, np.nan, -0.01, -0.2, -0.32, 1.94, 0.66, 1.34, 1.5, 0.1, np.nan, -0.14, 0.36, 0.3, 0.2, 0.9],
'size': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
'age': [0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 11, 0, 2, 3, 15, 16, 17, 0, 1, 1, 2, 22, 23]})
data = data.set_index(['ID', 'year'])
data = data.fillna(0)
X = data.drop('return', axis = 1)
pca = PCA(n_components = 1)
PCA_PCs = pca.fit(X).transform(X)
When I apply PCA on the data, I hope to get single principal component for every year, but instead I get:
print(PCA_PCs)
[[-11.16171282]
[-10.43225691]
[ -9.7023714 ]
[ -8.972357 ]
[ -7.467887 ]
[ -7.99143666]
[ -6.48749078]
[ -5.75773415]
[ -5.02805484]
[ -4.2978772 ]
[ 4.17395234]
[ -4.20443643]
[ -1.92727222]
[ -0.42299984]
[ 9.59778387]
[ 11.10139466]
[ 12.60586466]
[ -0.417883 ]
[ 1.08617458]
[ 1.81558753]
[ 3.31967947]
[ 19.53355614]
[ 21.03777697]]
Which corresponds to the total number of row in the dataframe. Is there a way that I can summarize or calculate a weighted principal component for each year?
You can assign the first principal component as a new column to the original DataFrame
. Then you can group by year and average.
data.assign(pc1=PCA_PCs[:, 0]).groupby('year')['pc1'].mean()
# year
# 2001 -5.726232
# 2002 -4.321416
# 2003 -3.502976
# 2004 -0.337490
# 2005 4.588059
# 2006 12.400073
# Name: pc1, dtype: float64
This is a valid way to summarize the data across years, as much as any average over a vector space. Which basically means it works if you trust and understand your principal components.
The first principal component will often just measure the average value, so you may just be showing that 2006 was larger than other years. Taking more components might help.
PCA_PCs = PCA(n_components = 2).fit_transform(X)
pca_df = pd.DataFrame(PCA_PCs, columns=['pc1', 'pc2'])]
years = data.reset_index()[['year']]
year_pca_df = pd.concat([years, pca_df], axis=1)
year_pca_df.groupby('year').mean()
# pc1 pc2
# year
# 2001 -5.726232 1.043865
# 2002 -4.321416 1.204462
# 2003 -3.502976 1.831200
# 2004 -0.337490 0.588444
# 2005 4.588059 -2.055726
# 2006 12.400073 -3.482995