Search code examples
pythonpcasklearn-pandas

Need to perform Principal component analysis on a dataframe collection in python using numpy or sklearn


I am having a 'dataframe collection' df with data below. I am trying to perform Principal component analysis(PCA) on dataframe collection using sklearn. But i am getting Typeerror

from sklearn.decomposition import PCA
df  # dataframe collection
pca = PCA(n_components=5)
pca.fit(X)

How to convert dataframe collection to array matrix with sequence. I think if i convert into array matrix, i will be able to do PCA

data:

{'USSP2 CMPN Curncy': 
 0       0.297453
 1       0.320505
 2       0.345978
 3       0.427871
 Name: (USSP2 CMPN Curncy, PX_LAST), Length: 1747, dtype: float64, 
 'MARGDEBT Index': 
 0     0.095478
 1     0.167469
 2     0.186317
 3     0.203729
 Name: (MARGDEBT Index, PX_LAST), Length: 79, dtype: float64, 
 'SL% SMT% Index': 
 0     0.163636
 1     0.000000
 2     0.000000
 3     0.363636
 Name: (SL% SMT% Index, PX_LAST), dtype: float64, 
 'FFSRAIWS Index': 
 0     0.157234
 1     0.278174
 2     0.530603
 3     0.526519
 Name: (FFSRAIWS Index, PX_LAST), dtype: float64, 
 'USPHNSA Index': 
 0     0.107330
 1     0.213351
 2     0.544503
 3     0.460733
 Name: (USPHNSA Index, PX_LAST), Length: 79, dtype: float64]

Can anyone help in PCA on dataframe collection. Thanks!


Solution

  • Your dataframe collection is a dictionary (dict) of DataFrame objects.

    To perform the analysis you need to have a array of data to work with. So the first step is to convert the data into a single DataFrame. Pandas natively supports concatenating from a dictionary of dataframes, e.g.

    import pandas as pd
    
    df = {
        'Currency1': pd.DataFrame([[0.297453,0.5]]),
        'Currency2': pd.DataFrame([[0.297453,0.5]])
    }      
    
    X = pd.concat(df)
    

    You can now perform the PCA on the values from the DataFrame, e.g.

    pca = PCA(n_components=5)
    pca.fit(X.values)