Search code examples
pythonpython-2.7numpymemory-managementmemory-profiling

Assigning names to large objects appears to increase memory usage considerably


Usually, when I need to invoke a complicated formula, I break it down into two or more lines to make the code more comprehensible. However, when profiling some code that calculates RMSE, I discovered that doing this appears to increase my code's memory use. Here's a simplified example:

import numpy as np
import random
from memory_profiler import profile

@profile
def fun1():
    #very large datasets (~750 mb each)
    predicted = np.random.rand(100000000)
    observed = np.random.rand(100000000)
    #calculate residuals as intermediate step
    residuals = observed - predicted
    #calculate RMSE
    RMSE = np.mean(residuals **2) ** 0.5
    #delete residuals
    del residuals

@profile
def fun2():
    #same sized data
    predicted = np.random.rand(100000000)
    observed = np.random.rand(100000000)
    #same calculation, but with residuals and RMSE calculated on same line
    RMSE = np.mean((observed - predicted) ** 2) ** 0.5

if __name__ == "__main__":
    fun1()
    fun2()

Output:

Filename: memtest.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     19.9 MiB      0.0 MiB   @profile
     6                             def fun1():
     7    782.8 MiB    763.0 MiB        predicted = np.random.rand(100000000)
     8   1545.8 MiB    762.9 MiB        observed = np.random.rand(100000000)
     9   2308.8 MiB    763.0 MiB        residuals = observed - predicted
    10   2308.8 MiB      0.1 MiB        RMSE = np.mean(residuals ** 2) ** 0.5
    11   1545.9 MiB   -762.9 MiB        del residuals


Filename: memtest.py

Line #    Mem usage    Increment   Line Contents
================================================
    13     20.0 MiB      0.0 MiB   @profile
    14                             def fun2():
    15    783.0 MiB    762.9 MiB        predicted = np.random.rand(100000000)
    16   1545.9 MiB    762.9 MiB        observed = np.random.rand(100000000)
    17   1545.9 MiB      0.0 MiB        RMSE = np.mean((observed - predicted) **
 2) ** 0.5

As you can see, the the first function (where the calculation is split) appears to require an additional ~750 mb at peak- presumably the cost of the residuals array. However, both functions require the array to be created- the only difference is that the first function assigns it a name. This is contrary to my understanding of the way memory management in python is supposed to work.

So, what's going on here? One thought is that this could be some artifact of the memory_profiler module. Watching the Windows task manager during a run indicates a similar pattern (though I know that's not a terribly trustworthy validation). If this is a "real" effect, what am I misunderstanding about the way memory is handled? Or, is this somehow numpy-specific?


Solution

  • memory_profiler's "Mem usage" columns tells you the memory usage after each line completes, not the peak memory usage during that line. In the version where you don't save residuals, that array is discarded before the line completes, so it never shows up in the profiler output.