I have a python pandas Series of dictionaries :
id dicts
1 {'5': 1, '8': 20, '1800': 2}
2 {'2': 2, '8': 1, '1000': 25, '1651': 1}
... ...
... ...
... ...
20000000 {'2': 1, '10': 20}
The (key, value) in the dictionaries represent ('feature', count). About 2000 unique features exist.
The Series' memory usage in pandas is about 500MB. What would be the best way to write this object to disk (having ideally low disk space usage, and being fast to write and fast to read back in afterwards) ?
Options considered (and tried for the first 2) :
- to_csv (but treats the dictionaries as strings, so conversion back to dictionaries afterwards is very slow)
- cPickle (but ran out of memory during execution)
- conversion to a scipy sparse matrix structure
I'm curious as to how your Series
only takes up 500MB. If you are using the .memory_usage
method, this will only return the total memory used by the each python object reference, which is all your Series is storing. That doesn't account for the actual memory of the dictionaries. Rough calculation 20,000,000 * 288 bytes = 5.76GB should be your memory usage. That 288 bytes is a conservative estimate of the memory required by each dictionary.
Anyway, try the following approach to convert your data into a sparse-matrix representation:
import numpy as np, pandas as pd
from sklearn.feature_extraction import DictVectorizer
from scipy.sparse import csr_matrix
import pickle
I would use int
s rather than strings as keys, as this will keep the right order later on. So, assuming your series is named dict_series
:
dict_series = dict_series.apply(lambda d: {int(k):d[k] for k in d}
This might be memory intensive, and you maybe be better off simply creating your Series
of dict
s using int
s as keys from the start. Or simply you can just skip this step. Now, to construct your sparse matrix:
dv = DictVectorizer(dtype=np.int32)
sparse = dv.fit_transform(dict_series)
Now, essentially, your sparse matrix can be reconstructed from 3 fields: sparse.data
, sparse.indices
, sparse.indptr
, an optionally, sparse.shape
. The fastest and most memory efficient way to save an load the arrays sparse.data
, sparse.indices
, sparse.indptr
is to use the np.ndarray tofile
method, which saves the arrays as raw bytes. From the documentation:
This is a convenience function for quick storage of array data. Information on endianness and precision is lost, so this method is not a good choice for files intended to archive data or transport data between machines with different endianness.
So this method loses any dtype information and endiamness. The former issue can be dealt with simply by making note of the datatype before hand, you'll be using np.int32 anyway. The latter issue isn't a problem if you are working locally, but if portability is important, you will need to look into alternate ways of storing the information.
# to save
sparse.data.tofile('data.dat')
sparse.indices.tofile('indices.dat')
sparse.indptr.tofile('indptr.dat')
# don't forget your dict vectorizer!
with open('dv.pickle', 'wb') as f:
pickle.dump(dv,f) # pickle your dv to be able to recover your original data!
with open('dv.pickle', 'rb') as f:
dv = pickle.load(f)
sparse = csr_matrix((np.fromfile('data.dat', dtype = np.int32),
np.fromfile('indices.dat', dtype = np.int32),
np.fromfile('indptr.dat', dtype = np.int32))
original = pd.Series(dv.inverse_transform(sparse))