Search code examples
pythonnumpymemory-leaksmutable

Memory leak when assigning numpy.argpartition() to list element multiple times


I am having trouble understanding a memory leak in my code. I guess that my mistake has to do with numpy arrays being mutable, since it can be solved using .copy().

I don't understand why this happens, though. Here is a minimal example of the code with memory leak, that uses around 1600MB in memory:

import numpy as np
import sys

k_neighbours = 5
np.random.seed(42)
data = np.random.rand(10000)

for _ in range(3):
    closest_neighbours = [
        # get indices of k closest neighbours
        np.argpartition(
            np.abs(data-point),
            k_neighbours
        )[:k_neighbours]
        for point in data
    ]

print('\nsize:',sys.getsizeof(closest_neighbours))
print('first 3 entries:',closest_neighbours[:3])

And here is the same code, but with an added .copy(). This seems to solve the problem, the program is about 80 MB in memory, as I would expect.

for _ in range(3):
    closest_neighbours = [
        # get indices of k closest neighbours
        np.argpartition(
            np.abs(data-point),
            k_neighbours
        )[:k_neighbours].copy()
        for point in data
    ]

print('\nsize:',sys.getsizeof(closest_neighbours))
print('first 3 entries:',closest_neighbours[:3])

The final result is the same for both:

size: 87624
first 3 entries: [
    array([   0, 3612, 2390,  348, 3976]),
    array([   1, 6326, 2638, 9978,  412]),
    array([5823, 5866,    2, 1003, 9307])
]

as expected.

I would have thought that np.argpartition() creates a new object, and therefore, I don't understand why copy() solves the memory problem. Even if that's not the case and np.argpartition() somehow changes the data object itself, why does that result in a memory leak?


Solution

  • Your problem can be boiled down to this example:

    import numpy as np
    
    array = np.empty(10000)
    view = array[:5]
    copy = array[:5].copy()
    

    Here the memory usage of the view object will also be much higher than the memory usage of the copy object.

    Explanation

    As described in the NumPy manual, "NumPy slicing creates a view instead of a copy". Therefore the underlying memory of the original array "will not be released until all arrays derived from it are garbage-collected."

    When slicing a large array the Numpy docs also suggests using copy(): "Care must be taken when extracting a small portion from a large array ... in such cases an explicit copy() is recommended."

    Measuring the memory usage

    The reason sys.getsizeof returned the same value in both of your examples is because, "only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to." In your examples you called sys.getsizeof on a list object, which therefore returns the size of the list and does not account for the size of NumPy arrays within it.

    For example, sys.getsizeof([None for _ in data]) will also return 87624.

    Memory usage of numpy arrays

    To get the size of the data array you can call sys.getsizeof with a data as its argument:

    sys.getsizeof(data)
    

    Now, to get the size of all arrays in your closest_neighbours list, you might try something like this:

    sum(sys.getsizeof(x) for x in closest_neighbours)
    

    Be aware that this will not work if the list contains any views. As stated in the Python Docs the sys.getsize "will return correct results [for built-in objects], but this does not have to hold true for third-party extensions as it is implementation specific." And in case of NumPy views view.__sizeof__() will return 96.