Search code examples
pythonpandasscikit-learnpca

Python : Merging/joining Dataframe after PCA transform results in NAN


import pickle
import numpy as np 
import pandas as pd
from sklearn.externals import joblib
from sklearn.decomposition import PCA
PCA = joblib.load('pcawithstandard.pkl')

with open('collist.pickle', 'rb') as handle:
    collist = pickle.load(handle)

for chunk in pd.read_csv('fortest.csv', chunksize = 5):

    _transformed = chunk[collist]
    _transformed = PCA.transform(_transformed)
    _transformed = pd.DataFrame(data=_transformed)

    _tempdata = chunk[['X__1']].join(_transformed)
    print(_tempdata)

enter image description here

I have a few big datasets which has 30k columns and the rows can range from 10k-40k i'm trying to transform the datasets with a previously fitted PCA i made, and thereafter joining it back with its row label 'X_1' based on individual dataframe's index.

Since it was a big dataset i decided the use chunksize so that i can limit the amount of memory being used each time.

The joining worked for the first set of chunk, but subsequent chunks just resulted in the right portion of my dataframe to be NAN.

I've checked that the dataframe containing my transformed data, and it does contain values.

instead of joining the transformed data, i tried joining the untransformed data and it seems to work, so I've no idea whats going on.

i suspect that the PCA transform had changed the structure of my dataframe, which resulted it not being able to join properly.

untransformed data has mixture of int64 and float64 dtypes columns, and are stored as object

transformed data columns are all float64 and are store as object too

untouched chunk data has object float64 and int64 and are stored as object too

I'm on Python 3.6.4 and My modules version are :

numpy (1.16.1)
pandas (0.24.1)
scikit-learn (0.20.2)

Appreciate any help and opinions i can get .

Thanks in advance!


Solution

  • Since you are performing the join on index column, you are able to do it successfully for the first chunk. But for all sebsequent chunks the PCA decomposition regenerates the index, because of which there is a mismatch between the indices of the original chunk and the decomposed one.

    You can do a reset_index on each each before decomposing it and you should be able to join it to the original column:

    _transformed = chunk[collist].reset_index(drop=True)
    

    Added drop=True for updated answer. :)