Search code examples
pythonpandasdataframesparse-matrixcountvectorizer

Efficient read and write of pandas dataframe


I have a pandas dataframe that I want to split into several smaller pieces of 100k rows each, then save onto the disk so that I can read in the data and process it one by one. I have tried using dill and hdf storage, as csv and raw text appears to take a lot of time.

I am trying this out on a subset of data with ~500k rows and five columns of mixed data. Two contains strings, one integers, one float and the final one contains bigram counts from sklearn.feature_extraction.text.CountVectorizer, stored as a scipy.sparse.csr.csr_matrix sparse matrix.

It is the last column that I am having problems with. Dumping and loading the data goes without issue, but when I try to actually access the data it is instead a pandas.Series object. Secondly, each row in that Series is a tuple which contains the whole dataset instead.

# Before dumping, the original df has 100k rows.
# Each column has one value except for 'counts' which has 1400. 
# Meaning that df['counts'] give me a sparse matrix that is 100k x 1400. 

vectorizer = sklearn.feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(2,2))
counts = vectorizer.fit_transform(df['string_data'])
df['counts'] = counts

df_split  =  pandas.DataFrame(np.column_stack([df['string1'][0:100000],
                                               df['string2'][0:100000],
                                               df['float'][0:100000],
                                               df['integer'][0:100000],
                                               df['counts'][0:100000]]),
                                               columns=['string1','string2','float','integer','counts'])
dill.dump(df, open(file[i], 'w'))

df = dill.load(file[i])
print(type(df['counts'])
> <class 'pandas.core.series.Series'>
print(np.shape(df['counts'])
> (100000,)
print(np.shape(df['counts'][0])
> (496718, 1400)    # 496718 is the number of rows in my complete data set.
print(type(df['counts']))
> <type 'tuple'>

Am I making any obvious mistake, or is there a better way to store this data in this format, one which isn't very time consuming? It has to be scalable to my full data containing 100 million rows.


Solution

  • df['counts'] = counts
    

    this will produce a Pandas Series (column) with the # of elements equal to len(df) and where each element is a sparse matrix, which is returned by vectorizer.fit_transform(df['string_data'])

    you can try to do the following:

    df = df.join(pd.DataFrame(counts.A, columns=vectorizer.get_feature_names(), index=df.index)
    

    NOTE: be aware this will explode your sparse matrix into densed (not sparse) DataFrame, so it will use much more memory and you can end up with the MemoryError

    CONCLUSION: That's why I'd recommend you to store your original DF and count sparse matrix separately