Search code examples
memory-leakspytorchfeature-extraction

Feature extraction in loop seems to cause memory leak in pytorch


I have spent considerable time trying to debug some pytorch code which I have created a minimal example of for the purpose of helping to better understand what the issue might be.

I have removed all necessary portions of the code which are unrelated to the issue so the remaining piece of code won't make much sense from a functional standpoint but it still displays the error I'm facing.

The overall task I'm working on is in a loop and every pass of the loop is computing the embedding of the image and adding it to a variable storing it. It's effectively aggregating it (not concatenating, so the size remains the same). I don't expect the number of iterations to force the datatype to overflow, I don't see this happening here nor in my code.

  • I have added multiple metrics to evaluate the size of the tensors I'm working with to make sure they're not growing in memory footprint
  • I'm checking the overall GPU memory usage to verify the issue leading to the final RuntimeError: CUDA out of memory..

My environment is as follows:

 - python 3.6.2
 - Pytorch 1.4.0
 - Cudatoolkit 10.0
 - Driver version 410.78
 - GPU: Nvidia GeForce GT 1030  (2GB VRAM)   
(though I've replicated this experiment with the same result on a Titan RTX with 24GB,
same pytorch version and cuda toolkit and driver, it only goes out of memory further in the loop).

Complete code below. I have marked 2 lines as culprits, as deleting them removes the issue, though obviously I need to find a way to execute them without having memory issues. Any help would be much appreciated! You may try with any image named "source_image.bmp" to replicate the issue.

import torch
from PIL import Image
import torchvision
from torchvision import transforms
from pynvml import nvmlDeviceGetHandleByIndex, nvmlDeviceGetMemoryInfo, nvmlInit
import sys
import os
os.environ["CUDA_VISIBLE_DEVICES"]='0'      # this is necessary on my system to allow the environment to recognize my nvidia GPU for some reason
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'    # to debug by having all CUDA functions executed in place
torch.set_default_tensor_type('torch.cuda.FloatTensor')

# Preprocess image
tfms = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224), 
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),])
img = tfms(Image.open('source_image.bmp')).unsqueeze(0).cuda()

model = torchvision.models.resnet50(pretrained=True).cuda()
model.eval()    # we put the model in evaluation mode, to prevent storage of gradient which might accumulate

nvmlInit()
h = nvmlDeviceGetHandleByIndex(0)
info = nvmlDeviceGetMemoryInfo(h)
print(f'Total available memory   : {info.total / 1000000000}')

feature_extractor = torch.nn.Sequential(*list(model.children())[:-1])
orig_embedding = feature_extractor(img)

embedding_depth = 2048

mem0 = 0

embedding = torch.zeros(2048, img.shape[2], img.shape[3]) #, dtype=torch.float)

patch_size=[4,4]
patch_stride=[2,2]
patch_value=0.0

# Here, we iterate over the patch placement, defined at the top left location
for row in range(img.shape[2]-1):
    for col in range(img.shape[3]-1):
        print("######################################################")        
                
        ######################################################
        # Isolated line, culprit 1 of the GPU memory leak
        ######################################################
        patched_embedding = feature_extractor(img)
        
        delta_embedding = (patched_embedding - orig_embedding).view(-1, 1, 1)
        
        ######################################################
        # Isolated line, culprit 2 of the GPU memory leak
        ######################################################
        embedding[:,row:row+1,col:col+1] = torch.add(embedding[:,row:row+1,col:col+1], delta_embedding)

        print("img size:\t\t", img.element_size() * img.nelement())
        print("patched_embedding size:\t", patched_embedding.element_size() * patched_embedding.nelement())
        print("delta_embedding size:\t", delta_embedding.element_size() * delta_embedding.nelement())
        print("Embedding size:\t\t", embedding.element_size() * embedding.nelement())

        del patched_embedding, delta_embedding
        torch.cuda.empty_cache()
        
        info = nvmlDeviceGetMemoryInfo(h)
        print("\nMem usage increase:\t", info.used / 1000000000 - mem0)
        mem0 = info.used / 1000000000
        print(f'Free:\t\t\t {(info.total - info.used) / 1000000000}')

print("Done.")

Solution

  • Add this to your code as soon as you load the model

    for param in model.parameters():
        param.requires_grad = False
    

    from https://pytorch.org/docs/stable/notes/autograd.html#excluding-subgraphs-from-backward