Memory leak when running mxnet cpu inference

I tested 100 pictures and the analysis of memory_profiler is listed below. Why does line 308 cause a lot of memory growth?

mxnet==1.5.1

Line #    Mem usage    Increment   Line Contents
================================================
   297 8693.719 MiB   81.809 MiB           data = nd.array(im_tensor)
   298 8693.719 MiB    0.000 MiB           db = mx.io.DataBatch(data=(data,), provide_data=[('data', data.shape)])
   299 8630.039 MiB    2.840 MiB           self.model.forward(db, is_train=False)
   300 8630.039 MiB    2.320 MiB           net_out = self.model.get_outputs()
   301 8693.719 MiB    2.062 MiB           for _idx,s in enumerate(self._feat_stride_fpn):
   302 8693.719 MiB    2.062 MiB               _key = 'stride%s'%s
   303 8693.719 MiB    1.031 MiB               stride = int(s)
   304 8693.719 MiB    1.031 MiB               if self.use_landmarks:
   305 8693.719 MiB    1.031 MiB                 idx = _idx*3
   306                                         else:
   307                                           idx = _idx*2
   308 8693.719 MiB 4700.676 MiB               scores = net_out[idx].asnumpy()
   309 8693.719 MiB    1.289 MiB               print scores.shape
   310 8693.719 MiB    1.031 MiB               scores = scores[:, self._num_anchors['stride%s'%s]:, :, :]
   311 8693.719 MiB    1.031 MiB               idx+=1
   312 8693.719 MiB    2.836 MiB               bbox_deltas = net_out[idx].asnumpy()
   ...

Solution

MXNet python API calls are actually just queued in the MXNet backend engine and processed asynchronously in C++. So what you see in this python profiler may not reflect what is actually happening under the hood. For profiling I recommend you to look at dedicated tools: https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/profiler.html

I suspect your .asnumpy() call is associated with high memory usage because it is a blocking call: it requires immediate availability of results and consequently forces the mxnet engine to compute necessary dependencies immediately. It is generally recommended to avoid using Numpy in MXNet code and use MXNet NDArray instead, which is more appropriate for deep learning (asynchronous, GPU-compatible, support for automatic differentiation) than Numpy. For example you could accumulate any info you need in an MXNet NDArray and then do whatever you need with it at the end of the execution (save to file, convert to Numpy, etc). More resources:

Common tips regarding Numpy vs NDArray in MXNet
This answer provides a summary of differences vs Numpy and MXNet NDArray