Search code examples
pythonpandasdataframegroup-by

How to improve self define function PCA code?


I work on my little project that need to do group by PCA. Everything is fine however I look for a way to improve self defined PCA function.

Self defined function I use:

def pca(data):
    try:
        x = stats.zscore(data, nan_policy='omit')
        covar = np.cov(x, rowvar=False)
        eigval, eigvec = np.linalg.eig(covar)
    except Exception as e:
        return pd.Series([np.NaN]*len(data))
    else:
        return x@eigvec[:, :1]

I use this function to calculate 1st vector PCA as follow:

sam.groupby('gvkey')[['xgat', 'xgsale', 'xcap']].apply(pca)

Everything works fine. However, the only little issue is that there are three columns output. 1st is the gvkey, the 2nd is empty, and the 3rd is 0.

enter image description here

What I want: improve my self defined function so that the output has no 2nd index column. In general, the result should be similar to using groupby['col'].transform('mean')

I do not want work around solution like using reset_index() as: sam.groupby('gvkey')[['xgat', 'xgsale', 'xcap']].apply(pca).reset_index(level=1, drop=True).


Solution

  • You can use group_keys=False to remove the group key:

    from scipy.linalg import LinAlgError
    
    def pca(data):
        try:
            x = stats.zscore(data, nan_policy='omit')
            covar = np.cov(x, rowvar=False)
            eigval, eigvec = np.linalg.eig(covar)
        except LinAlgError:
            pass
        else:
            return x@eigvec[:, :1]
    
    sam.groupby('gvkey', group_keys=False)[['xgat', 'xgsale', 'xcap']].apply(pca)