PCA on panel data with many IDs but only derive one principal component for each date

For the sake of simplicity and reproducability, here is the code for generating a sample dataset:

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA

data = pd.DataFrame({ 
    'year':['2001', '2002', '2003', '2004', '2005', '2001', '2002', '2003', '2004', '2005', '2006', '2001', '2002', '2003', '2004', '2005', '2006', '2001', '2002', '2003', '2004', '2005', '2006'],
    'ID':[1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4], 
    'factor':[np.nan, 0.45, .4, .2, -0.3, np.nan, .11, .21, .4, .01, np.nan, -0.32, 0.93, 0.66, np.nan, 0.5, np.nan, -0.12, -0.14, 0.36, 0.3, 0.21, np.nan],
    'return':[.11, 0.45, .34, .52, -0.93, 1.54, 1.01, .31, np.nan, -0.01, -0.2, -0.32, 1.94, 0.66, 1.34, 1.5, 0.1, np.nan, -0.14, 0.36, 0.3, 0.2, 0.9],
    'size': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
    'age': [0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 11, 0, 2, 3, 15, 16, 17, 0, 1, 1, 2, 22, 23]})
data = data.set_index(['ID', 'year'])
data = data.fillna(0)

X = data.drop('return', axis = 1)

pca = PCA(n_components = 1)
PCA_PCs = pca.fit(X).transform(X)

When I apply PCA on the data, I hope to get single principal component for every year, but instead I get:

print(PCA_PCs)

[[-11.16171282]
 [-10.43225691]
 [ -9.7023714 ]
 [ -8.972357  ]
 [ -7.467887  ]
 [ -7.99143666]
 [ -6.48749078]
 [ -5.75773415]
 [ -5.02805484]
 [ -4.2978772 ]
 [  4.17395234]
 [ -4.20443643]
 [ -1.92727222]
 [ -0.42299984]
 [  9.59778387]
 [ 11.10139466]
 [ 12.60586466]
 [ -0.417883  ]
 [  1.08617458]
 [  1.81558753]
 [  3.31967947]
 [ 19.53355614]
 [ 21.03777697]]

Which corresponds to the total number of row in the dataframe. Is there a way that I can summarize or calculate a weighted principal component for each year?

Solution

You can assign the first principal component as a new column to the original DataFrame. Then you can group by year and average.

data.assign(pc1=PCA_PCs[:, 0]).groupby('year')['pc1'].mean()
# year
# 2001    -5.726232
# 2002    -4.321416
# 2003    -3.502976
# 2004    -0.337490
# 2005     4.588059
# 2006    12.400073
# Name: pc1, dtype: float64

This is a valid way to summarize the data across years, as much as any average over a vector space. Which basically means it works if you trust and understand your principal components.

The first principal component will often just measure the average value, so you may just be showing that 2006 was larger than other years. Taking more components might help.

PCA_PCs = PCA(n_components = 2).fit_transform(X) 
pca_df = pd.DataFrame(PCA_PCs, columns=['pc1', 'pc2'])]

years = data.reset_index()[['year']]
year_pca_df = pd.concat([years, pca_df], axis=1)
year_pca_df.groupby('year').mean()
#             pc1       pc2
# year                     
# 2001  -5.726232  1.043865
# 2002  -4.321416  1.204462
# 2003  -3.502976  1.831200
# 2004  -0.337490  0.588444
# 2005   4.588059 -2.055726
# 2006  12.400073 -3.482995