Search code examples
pythonnumpydata-analysis

How to merge two large numpy arrays if slicing doesn't resolve memory error?


I have two numpy arrays container1 and container2 where container1.shape = (900,4000) and container2.shape = (5000,4000). Merging them using vstack results in a MemoryError. After searching through the old questions posted here, I tried to merge them using slicing like this:

mergedContainer = numpy.vstack((container1, container2[:1000]))
mergedContainer = numpy.vstack((mergedContainer, container[1000:2500]))
mergedContainer = numpy.vstack((mergedContainer, container[2500:3000]))

but after this even if I do:

mergedContainer = numpy.vstack((mergedContainer, container[3000:3100]))

it results in MemoryError.

I am using Python 3.4.3 (32-Bit) and would like to resolve without shifting to 64-Bit.


Solution

  • Every time you call np.vstack NumPy has to allocate space for a brand new array. So if we say 1 row requires 1 unit of memory

    np.vstack([container, container2])
    

    requires an additional 900+5000 units of memory. Moreover, before the assignment occurs, Python needs to hold space for the old mergedContainer (if it exists) as well as space for the new mergedContainer. So building mergedContainer iteratively with slices actually requires more memory than trying to build it with a single call to np.vstack.

    Building it iteratively:

    | total | mergedContainer | container1 | container2 | temp |                                                                      |
    |-------+-----------------+------------+------------+------+----------------------------------------------------------------------|
    |  7800 |            1900 |        900 |       5000 |    0 | mergedContainer = np.vstack((container1, container2[:1000]))         |
    | 11200 |            3400 |        900 |       5000 | 1900 | mergedContainer = np.vstack((mergedContainer, container[1000:2500])) |
    | 13200 |            3900 |        900 |       5000 | 3400 | mergedContainer = np.vstack((mergedContainer, container[2500:3000])) |
    

    Building it from a single call to np.vstack:

    | total | mergedContainer | container1 | container2 | temp |                                                       |
    |-------+-----------------+------------+------------+------+-------------------------------------------------------|
    | 11800 |            5900 |        900 |       5000 |    0 | mergedContainer = np.vstack((container1, container2)) |
    

    We can do even better, however. Instead of calling np.vstack repeatedly, allocate all the space that is needed once from the very beginning and write the contents of both container1 and container2 into it. In other words, avoid allocating two disparate arrays container1 and container2 if you know eventually you want to merge them.

    container = np.empty((5900, 4000))
    

    Note that basic slices such as container[:900] always return views, and views require essentially no additional memory. So you could define container1 and container2 like this:

    container1 = container[:900]   
    container2 = container[900:]   
    

    and then assign values in place. This modifies container:

    container1[:] = ...              
    container2[:] = ...
    

    Thus your your memory requirement would stay around 5900 units.


    For example,

    import numpy as np
    np.random.seed(2015)
    
    container = np.empty((5, 4), dtype='int')
    container1 = container[:2]   
    container2 = container[2:]   
    container1[:] = np.random.randint(10, size=(2,4))
    container2[:] = np.random.randint(1000, size=(3,4))
    print(container)
    

    yields

    [[  2   2   9   6]
     [  8   5   7   8]
     [112  70 487 124]
     [859   8 275 936]
     [317 134 393 909]]
    

    while only requiring space for one array of shape (5, 4), and temporarly-used space for the random arrays.

    Thus, you wouldn't have to change very much in your code to save memory. Just set it up with

    container = np.empty((5900, 4000))
    container1 = container[:900]   
    container2 = container[900:]   
    

    and then use

    container1[:] = ...
    

    instead of

    container1 = ...
    

    to assign values in-place. (Or, of course, you could just write directly into container.)