I am running a PCA on the columns of a data frame DF1 and it returns an array of principal components. I would like to create a data frame DF2 that has the same index as DF1, and that contains the value from the array of principal components.
From
DF1=
v1 v2 v3
2014-01-02 0.58 0.89 -0.19
2014-01-03 -1.96 0.59 1.24
2014-01-04 2.06 -0.15 3.54
2014-01-05 0.31 1.25 -2.42
2014-01-06 1.31 0.33 0.89
... ... ... ...
PCs=
array([[ 0.14411173, -0.25557942, 0.08295314, ..., -0.24914411,
-0.35242784, 0.17412245],
[ 0.15391876, -0.3063616 , -0.62369197, ..., 0.18915513,
-0.39056901, 0.23227158],
[-0.00493105, -0.31936978, 0.35831582, ..., -0.2781707 ,
-0.29810411, 0.27513239],
[-0.5870741 , 0.16183593, 0.10528634, ..., -0.21776753,
-0.30365561, 0.17920256],
[-0.6353732 , -0.28649561, -0.21702067, ..., 0.36312823,
-0.11915208, -0.36003616]])
(in the PCs array, each line is a PC) Get
DF2=
PC1 PC2 PC3
2014-01-02 0.14411173 -0.15391876 ...
2014-01-03 0.25557942 -0.39056901
2014-01-04 ...
2014-01-05
2014-01-06
... ... ... ...
You can create a new pandas DataFrame while explictly passing the index of your df1
and transposing the pca
array.
First create some dummy data:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.random(size=(3, 5)), index=pd.date_range(start="2014-01-02", periods=3))
print(df1)
0 1 2 3 4
2014-01-02 0.875032 0.853087 0.686504 0.682114 0.199243
2014-01-03 0.522381 0.606048 0.398451 0.799883 0.030091
2014-01-04 0.489119 0.997239 0.021816 0.307509 0.099752
# create dummy pca results
pca = np.random.random(size=(2, 3))
print(pca)
[[ 0.42791681 0.56512179 0.44731657]
[ 0.10763007 0.35437208 0.79968957]]
Now, build the column names, and create the pandas DataFrame while passing the index and columns along with the transposed pca array:
columns = ["PC{}".format(x + 1) for x in range(pca.shape[0])]
df2 = pd.DataFrame(pca.T, index=df1.index, columns=columns)
print(df2)
PC1 PC2
2014-01-02 0.427917 0.107630
2014-01-03 0.565122 0.354372
2014-01-04 0.447317 0.799690
To answer your second question: I don't think that there is a more efficient way to create the DataFrame directly.