I am having trouble understanding a memory leak in my code. I guess that my mistake has to do with numpy arrays being mutable, since it can be solved using .copy()
.
I don't understand why this happens, though. Here is a minimal example of the code with memory leak, that uses around 1600MB in memory:
import numpy as np
import sys
k_neighbours = 5
np.random.seed(42)
data = np.random.rand(10000)
for _ in range(3):
closest_neighbours = [
# get indices of k closest neighbours
np.argpartition(
np.abs(data-point),
k_neighbours
)[:k_neighbours]
for point in data
]
print('\nsize:',sys.getsizeof(closest_neighbours))
print('first 3 entries:',closest_neighbours[:3])
And here is the same code, but with an added .copy()
. This seems to solve the problem, the program is about 80 MB in memory, as I would expect.
for _ in range(3):
closest_neighbours = [
# get indices of k closest neighbours
np.argpartition(
np.abs(data-point),
k_neighbours
)[:k_neighbours].copy()
for point in data
]
print('\nsize:',sys.getsizeof(closest_neighbours))
print('first 3 entries:',closest_neighbours[:3])
The final result is the same for both:
size: 87624
first 3 entries: [
array([ 0, 3612, 2390, 348, 3976]),
array([ 1, 6326, 2638, 9978, 412]),
array([5823, 5866, 2, 1003, 9307])
]
as expected.
I would have thought that np.argpartition()
creates a new object, and therefore, I don't understand why copy()
solves the memory problem. Even if that's not the case and np.argpartition()
somehow changes the data
object itself, why does that result in a memory leak?
Your problem can be boiled down to this example:
import numpy as np
array = np.empty(10000)
view = array[:5]
copy = array[:5].copy()
Here the memory usage of the view
object will also be much higher than the memory usage of the copy
object.
Explanation
As described in the NumPy manual, "NumPy slicing creates a view instead of a copy". Therefore the underlying memory of the original array "will not be released until all arrays derived from it are garbage-collected."
When slicing a large array the Numpy docs also suggests using copy()
:
"Care must be taken when extracting a small portion from a large array ... in such cases an explicit copy() is recommended."
Measuring the memory usage
The reason sys.getsizeof
returned the same value in both of your examples is because, "only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to." In your examples you called sys.getsizeof
on a list object, which therefore returns the size of the list and does not account for the size of NumPy arrays within it.
For example, sys.getsizeof([None for _ in data])
will also return 87624
.
Memory usage of numpy arrays
To get the size of the data
array you can call sys.getsizeof
with a data
as its argument:
sys.getsizeof(data)
Now, to get the size of all arrays in your closest_neighbours
list, you might try something like this:
sum(sys.getsizeof(x) for x in closest_neighbours)
Be aware that this will not work if the list contains any views
. As stated in the Python Docs the sys.getsize
"will return correct results [for built-in objects], but this does not have to hold true for third-party extensions as it is implementation specific." And in case of NumPy views view.__sizeof__()
will return 96.