import pickle
import numpy as np
import pandas as pd
from sklearn.externals import joblib
from sklearn.decomposition import PCA
PCA = joblib.load('pcawithstandard.pkl')
with open('collist.pickle', 'rb') as handle:
collist = pickle.load(handle)
for chunk in pd.read_csv('fortest.csv', chunksize = 5):
_transformed = chunk[collist]
_transformed = PCA.transform(_transformed)
_transformed = pd.DataFrame(data=_transformed)
_tempdata = chunk[['X__1']].join(_transformed)
print(_tempdata)
I have a few big datasets which has 30k columns and the rows can range from 10k-40k
i'm trying to transform the datasets with a previously fitted PCA i made, and thereafter joining it back with its row label 'X_1'
based on individual dataframe's index.
Since it was a big dataset i decided the use chunksize so that i can limit the amount of memory being used each time.
The joining worked for the first set of chunk, but subsequent chunks just resulted in the right portion of my dataframe to be NAN.
I've checked that the dataframe containing my transformed data, and it does contain values.
instead of joining the transformed data, i tried joining the untransformed data and it seems to work, so I've no idea whats going on.
i suspect that the PCA transform had changed the structure of my dataframe, which resulted it not being able to join properly.
untransformed data has mixture of int64
and float64
dtypes columns, and are stored as object
transformed data columns are all float64
and are store as object
too
untouched chunk data has object float64
and int64
and are stored as object
too
I'm on Python 3.6.4 and My modules version are :
numpy (1.16.1)
pandas (0.24.1)
scikit-learn (0.20.2)
Appreciate any help and opinions i can get .
Thanks in advance!
Since you are performing the join
on index
column, you are able to do it successfully for the first chunk.
But for all sebsequent chunks the PCA
decomposition regenerates the index
, because of which there is a mismatch between the indices of the original chunk and the decomposed one.
You can do a reset_index
on each each before decomposing it and you should be able to join it to the original column:
_transformed = chunk[collist].reset_index(drop=True)
Added drop=True
for updated answer. :)