Search code examples
pythonpandasmemoryiteratorhdfstore

Iterator and chunksize in HDFStore.select: "Memory error"


To my understanding, HDFStore.select is the tool to use for selecting from large data sets. However, when trying to loop over chunks using chunksize and iterator=True, the iterator itself becomes a very large object once the underlying dataset is large enough, and I don't understand why the iterator object is large and what kind of information it contains that it has to become so large.

I have a very large HDFStore structure (7 bn rows, 420 GB on-disk), which I would like to iterate by chunks:

iterator = HDFStore.select('df', iterator=True, chunksize=chunksize)

for i, chunk in enumerate(iterator):
    # some code to apply to each chunk

When I run this code for a relatively small file - everything works fine. However, when I try to apply it to the 7 bn row database, I get a Memory Error when computing iterator. I have 32 GB RAM.

I would like to have a generator to create chunks on the go, which doesn't store so much into RAM, for example:

iteratorGenerator = lambda: HDFStore.select('df', iterator=True, chunksize=chunksize)

for i, chunk in enumerate(iteratorGenerator):
    # some code to apply to each chunk

but iteratorGenerator is not iterable, so this doesn't work either.

I could potentially loop the HDFStore.select over start and stop rows, but I thought that there should be a more elegant way to iterate.


Solution

  • I had that same problem with a (only) 30GB file, and apparently you can solve it by forcing the garbage collector to do its job... collect! :P PS: Also you don't need a lambda for that, the select call will return an iterator, just loop over it, like you did i the first code block.

    with pd.HDFStore(file_path, mode='a') as store:
        # All you need is the chunksize
        # not the iterator=True
        iterator = store.select('df', chunksize=chunksize)
    
        for i, chunk in enumerate(iterator):
    
            # some code to apply to each chunk
    
            # magic line, that solved my memory problem
            # You also need "import gc" for this
            gc.collect()