Search code examples
pythoniteratornumpy-memmap

Getting the index of the next element in a very large memmap which satisfies a condition


I have a memmap to a very large (10-100 GB) file containing current and voltage data. From a given starting index, I want to find the index of the next point for which the voltage satisfies a given condition.

In the case of a relatively small list I could do this with an iterator like so:

filename = '[redacted]'
columntypes = np.dtype([('current', '>f8'), ('voltage', '>f8')])
data = np.memmap(filename, dtype=columntypes)
current = data['current']
voltage = data['voltage']

condition = (i for i,v in enumerate(voltage) if voltage > 0.1)
print next(condition)

but because my memmap is so large, it can't build the iterator. Is there a way to do this in a pythonic way without actually loading the data into memory? I can always take the ugly approach of reading in chunks of data and looping through it until I find the index I need, but this seems inelegant.


Solution

  • If the file has formatting in the form of line breaks (like a space/new line delimitted .csv) you can read and process line by line:

    with open("foo.bar") as f:
        for line in f:
            do_something(line)
    

    Processing the file in chunks doesn't necessarily have to be ugly using something like:

    with open("foo.bar") as f:
        for chunk in iter(lambda: f.read(128), ""):
            do_something(chunk)
    

    In your case, if you know the size of each input (current voltage pair), you can load the chunk in as raw bytes than do some conditionals on the raw data.

    sizeDataPoint = 128
    
    index = 0
    
    lastIndex = None
    
    with open("foo.bar") as f:
        for chunk in iter(lambda: f.read(sizeDataPoint), ""):
            if(check_conditions(chunk)):
                lastIndex = index
            index += 1
    

    If it needs to be memory mapped, I'm not 100% sure about numpy's memmap, but I remember using a Python library called mmap (used it a long time ago) to handle very large files. If I remember correctly it does this through an OS process called "paging".

    The efficacy of this attempt will depend on whether or not your OS supports it, and how well it can handle garbage collection while iterating through the file, but I think in theory it's possible to exceed Python's memory limit using mmap.

    EDIT: Also, mmap large file's wont work unless you're using 64bit OS since it maps file to memory using the same address space.