python pandas dictionary scipy sparse-matrix

Pandas : saving Series of dictionaries to disk

I have a python pandas Series of dictionaries :

id           dicts
1            {'5': 1, '8': 20, '1800': 2}
2            {'2': 2, '8': 1, '1000': 25, '1651': 1}
...          ...
...          ...
...          ...
20000000     {'2': 1, '10': 20}

The (key, value) in the dictionaries represent ('feature', count). About 2000 unique features exist.

The Series' memory usage in pandas is about 500MB. What would be the best way to write this object to disk (having ideally low disk space usage, and being fast to write and fast to read back in afterwards) ?

Options considered (and tried for the first 2) :
- to_csv (but treats the dictionaries as strings, so conversion back to dictionaries afterwards is very slow)
- cPickle (but ran out of memory during execution)
- conversion to a scipy sparse matrix structure

Solution

I'm curious as to how your Series only takes up 500MB. If you are using the .memory_usage method, this will only return the total memory used by the each python object reference, which is all your Series is storing. That doesn't account for the actual memory of the dictionaries. Rough calculation 20,000,000 * 288 bytes = 5.76GB should be your memory usage. That 288 bytes is a conservative estimate of the memory required by each dictionary.

Converting to a sparse matrix

Anyway, try the following approach to convert your data into a sparse-matrix representation:

import numpy as np, pandas as pd
from sklearn.feature_extraction import DictVectorizer
from scipy.sparse import csr_matrix
import pickle

I would use ints rather than strings as keys, as this will keep the right order later on. So, assuming your series is named dict_series:

dict_series = dict_series.apply(lambda d: {int(k):d[k] for k in d}

This might be memory intensive, and you maybe be better off simply creating your Series of dicts using ints as keys from the start. Or simply you can just skip this step. Now, to construct your sparse matrix:

dv = DictVectorizer(dtype=np.int32)
sparse = dv.fit_transform(dict_series)

Saving to disk

Now, essentially, your sparse matrix can be reconstructed from 3 fields: sparse.data, sparse.indices, sparse.indptr, an optionally, sparse.shape. The fastest and most memory efficient way to save an load the arrays sparse.data, sparse.indices, sparse.indptr is to use the np.ndarray tofile method, which saves the arrays as raw bytes. From the documentation:

This is a convenience function for quick storage of array data. Information on endianness and precision is lost, so this method is not a good choice for files intended to archive data or transport data between machines with different endianness.

So this method loses any dtype information and endiamness. The former issue can be dealt with simply by making note of the datatype before hand, you'll be using np.int32 anyway. The latter issue isn't a problem if you are working locally, but if portability is important, you will need to look into alternate ways of storing the information.

# to save
sparse.data.tofile('data.dat')
sparse.indices.tofile('indices.dat')
sparse.indptr.tofile('indptr.dat')
# don't forget your dict vectorizer!
with open('dv.pickle', 'wb') as f:
    pickle.dump(dv,f) # pickle your dv to be able to recover your original data!

To recover everything:

with open('dv.pickle', 'rb') as f:
    dv = pickle.load(f)

sparse = csr_matrix((np.fromfile('data.dat', dtype = np.int32),
                     np.fromfile('indices.dat', dtype = np.int32),
                     np.fromfile('indptr.dat', dtype = np.int32))

original = pd.Series(dv.inverse_transform(sparse))