Search code examples
pythonpandasdataframepca

How to efficiently pass an array to a data frame?


I am running a PCA on the columns of a data frame DF1 and it returns an array of principal components. I would like to create a data frame DF2 that has the same index as DF1, and that contains the value from the array of principal components.

From

 DF1=
                   v1       v2       v3
     2014-01-02   0.58     0.89    -0.19
     2014-01-03  -1.96     0.59     1.24
     2014-01-04   2.06    -0.15     3.54
     2014-01-05   0.31     1.25    -2.42
     2014-01-06   1.31     0.33     0.89
     ...          ...       ...      ...

PCs=
array([[ 0.14411173, -0.25557942,  0.08295314, ..., -0.24914411,
        -0.35242784,  0.17412245],
       [ 0.15391876, -0.3063616 , -0.62369197, ...,  0.18915513,
        -0.39056901,  0.23227158],
       [-0.00493105, -0.31936978,  0.35831582, ..., -0.2781707 ,
        -0.29810411,  0.27513239],
       [-0.5870741 ,  0.16183593,  0.10528634, ..., -0.21776753,
        -0.30365561,  0.17920256],
       [-0.6353732 , -0.28649561, -0.21702067, ...,  0.36312823,
        -0.11915208, -0.36003616]])

(in the PCs array, each line is a PC) Get

DF2=
                          PC1         PC2         PC3
         2014-01-02   0.14411173  -0.15391876    ...
         2014-01-03   0.25557942  -0.39056901
         2014-01-04   ...
         2014-01-05   
         2014-01-06   
         ...          ...       ...      ...
  1. How to efficiently put the array of PCs in the data frame ?
  2. Is there a better, more efficient way to get what I want than running the PCA on the data frame and then incorporating the array in a new data frame? (for example a way of getting the PCs directly in a data frame)

Solution

  • You can create a new pandas DataFrame while explictly passing the index of your df1 and transposing the pca array.

    First create some dummy data:

    import pandas as pd
    import numpy as np
    
    df1 = pd.DataFrame(np.random.random(size=(3, 5)), index=pd.date_range(start="2014-01-02", periods=3))
    print(df1)
    
                       0         1         2         3         4
    2014-01-02  0.875032  0.853087  0.686504  0.682114  0.199243
    2014-01-03  0.522381  0.606048  0.398451  0.799883  0.030091
    2014-01-04  0.489119  0.997239  0.021816  0.307509  0.099752
    
    # create dummy pca results
    pca = np.random.random(size=(2, 3))
    print(pca)
    
    [[ 0.42791681  0.56512179  0.44731657]
     [ 0.10763007  0.35437208  0.79968957]]
    

    Now, build the column names, and create the pandas DataFrame while passing the index and columns along with the transposed pca array:

    columns = ["PC{}".format(x + 1) for x in range(pca.shape[0])]
    df2 = pd.DataFrame(pca.T, index=df1.index, columns=columns)
    print(df2)
    
                     PC1       PC2
    2014-01-02  0.427917  0.107630
    2014-01-03  0.565122  0.354372
    2014-01-04  0.447317  0.799690
    

    To answer your second question: I don't think that there is a more efficient way to create the DataFrame directly.