I work on my little project that need to do group by PCA. Everything is fine however I look for a way to improve self defined PCA function.
Self defined function I use:
def pca(data):
try:
x = stats.zscore(data, nan_policy='omit')
covar = np.cov(x, rowvar=False)
eigval, eigvec = np.linalg.eig(covar)
except Exception as e:
return pd.Series([np.NaN]*len(data))
else:
return x@eigvec[:, :1]
I use this function to calculate 1st vector PCA as follow:
sam.groupby('gvkey')[['xgat', 'xgsale', 'xcap']].apply(pca)
Everything works fine. However, the only little issue is that there are three columns output. 1st is the gvkey
, the 2nd is empty
, and the 3rd is 0
.
What I want: improve my self defined function so that the output has no 2nd index column. In general, the result should be similar to using groupby['col'].transform('mean')
I do not want work around solution like using reset_index()
as: sam.groupby('gvkey')[['xgat', 'xgsale', 'xcap']].apply(pca).reset_index(level=1, drop=True)
.
You can use group_keys=False
to remove the group key:
from scipy.linalg import LinAlgError
def pca(data):
try:
x = stats.zscore(data, nan_policy='omit')
covar = np.cov(x, rowvar=False)
eigval, eigvec = np.linalg.eig(covar)
except LinAlgError:
pass
else:
return x@eigvec[:, :1]
sam.groupby('gvkey', group_keys=False)[['xgat', 'xgsale', 'xcap']].apply(pca)