Feature extraction in loop seems to cause memory leak in pytorch

I have spent considerable time trying to debug some pytorch code which I have created a minimal example of for the purpose of helping to better understand what the issue might be.

I have removed all necessary portions of the code which are unrelated to the issue so the remaining piece of code won't make much sense from a functional standpoint but it still displays the error I'm facing.

The overall task I'm working on is in a loop and every pass of the loop is computing the embedding of the image and adding it to a variable storing it. It's effectively aggregating it (not concatenating, so the size remains the same). I don't expect the number of iterations to force the datatype to overflow, I don't see this happening here nor in my code.

I have added multiple metrics to evaluate the size of the tensors I'm working with to make sure they're not growing in memory footprint
I'm checking the overall GPU memory usage to verify the issue leading to the final RuntimeError: CUDA out of memory..

My environment is as follows:

 - python 3.6.2
 - Pytorch 1.4.0
 - Cudatoolkit 10.0
 - Driver version 410.78
 - GPU: Nvidia GeForce GT 1030  (2GB VRAM)   
(though I've replicated this experiment with the same result on a Titan RTX with 24GB,
same pytorch version and cuda toolkit and driver, it only goes out of memory further in the loop).

Complete code below. I have marked 2 lines as culprits, as deleting them removes the issue, though obviously I need to find a way to execute them without having memory issues. Any help would be much appreciated! You may try with any image named "source_image.bmp" to replicate the issue.

import torch
from PIL import Image
import torchvision
from torchvision import transforms
from pynvml import nvmlDeviceGetHandleByIndex, nvmlDeviceGetMemoryInfo, nvmlInit
import sys
import os
os.environ["CUDA_VISIBLE_DEVICES"]='0'      # this is necessary on my system to allow the environment to recognize my nvidia GPU for some reason
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'    # to debug by having all CUDA functions executed in place
torch.set_default_tensor_type('torch.cuda.FloatTensor')

# Preprocess image
tfms = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224), 
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),])
img = tfms(Image.open('source_image.bmp')).unsqueeze(0).cuda()

model = torchvision.models.resnet50(pretrained=True).cuda()
model.eval()    # we put the model in evaluation mode, to prevent storage of gradient which might accumulate

nvmlInit()
h = nvmlDeviceGetHandleByIndex(0)
info = nvmlDeviceGetMemoryInfo(h)
print(f'Total available memory   : {info.total / 1000000000}')

feature_extractor = torch.nn.Sequential(*list(model.children())[:-1])
orig_embedding = feature_extractor(img)

embedding_depth = 2048

mem0 = 0

embedding = torch.zeros(2048, img.shape[2], img.shape[3]) #, dtype=torch.float)

patch_size=[4,4]
patch_stride=[2,2]
patch_value=0.0

# Here, we iterate over the patch placement, defined at the top left location
for row in range(img.shape[2]-1):
    for col in range(img.shape[3]-1):
        print("######################################################")        
                
        ######################################################
        # Isolated line, culprit 1 of the GPU memory leak
        ######################################################
        patched_embedding = feature_extractor(img)
        
        delta_embedding = (patched_embedding - orig_embedding).view(-1, 1, 1)
        
        ######################################################
        # Isolated line, culprit 2 of the GPU memory leak
        ######################################################
        embedding[:,row:row+1,col:col+1] = torch.add(embedding[:,row:row+1,col:col+1], delta_embedding)

        print("img size:\t\t", img.element_size() * img.nelement())
        print("patched_embedding size:\t", patched_embedding.element_size() * patched_embedding.nelement())
        print("delta_embedding size:\t", delta_embedding.element_size() * delta_embedding.nelement())
        print("Embedding size:\t\t", embedding.element_size() * embedding.nelement())

        del patched_embedding, delta_embedding
        torch.cuda.empty_cache()
        
        info = nvmlDeviceGetMemoryInfo(h)
        print("\nMem usage increase:\t", info.used / 1000000000 - mem0)
        mem0 = info.used / 1000000000
        print(f'Free:\t\t\t {(info.total - info.used) / 1000000000}')

print("Done.")

Solution

Add this to your code as soon as you load the model

for param in model.parameters():
    param.requires_grad = False

from https://pytorch.org/docs/stable/notes/autograd.html#excluding-subgraphs-from-backward