To my understanding, HDFStore.select
is the tool to use for selecting from large data sets. However, when trying to loop over chunks using chunksize
and iterator=True
, the iterator itself becomes a very large object once the underlying dataset is large enough, and I don't understand why the iterator object is large and what kind of information it contains that it has to become so large.
I have a very large HDFStore
structure (7 bn rows, 420 GB on-disk), which I would like to iterate by chunks:
iterator = HDFStore.select('df', iterator=True, chunksize=chunksize)
for i, chunk in enumerate(iterator):
# some code to apply to each chunk
When I run this code for a relatively small file - everything works fine.
However, when I try to apply it to the 7 bn row database, I get a Memory Error
when computing iterator. I have 32 GB RAM.
I would like to have a generator to create chunks on the go, which doesn't store so much into RAM, for example:
iteratorGenerator = lambda: HDFStore.select('df', iterator=True, chunksize=chunksize)
for i, chunk in enumerate(iteratorGenerator):
# some code to apply to each chunk
but iteratorGenerator
is not iterable, so this doesn't work either.
I could potentially loop the HDFStore.select
over start
and stop
rows, but I thought that there should be a more elegant way to iterate.
I had that same problem with a (only) 30GB file, and apparently you can solve it by forcing the garbage collector to do its job... collect! :P PS: Also you don't need a lambda for that, the select call will return an iterator, just loop over it, like you did i the first code block.
with pd.HDFStore(file_path, mode='a') as store:
# All you need is the chunksize
# not the iterator=True
iterator = store.select('df', chunksize=chunksize)
for i, chunk in enumerate(iterator):
# some code to apply to each chunk
# magic line, that solved my memory problem
# You also need "import gc" for this
gc.collect()