python pytorch torch amazon-sagemaker torchvision

Pytorch CUDA OutOfMemory Error while training

I'm trying to train a PyTorch FLAIR model in AWS Sagemaker. While doing so getting the following error:

RuntimeError: CUDA out of memory. Tried to allocate 84.00 MiB (GPU 0; 11.17 GiB total capacity; 9.29 GiB already allocated; 7.31 MiB free; 10.80 GiB reserved in total by PyTorch)

For training I used sagemaker.pytorch.estimator.PyTorch class.

I tried with different variants of instance types from ml.m5, g4dn to p3(even with a 96GB memory one). In the ml.m5 getting the error with CPUmemoryIssue, in g4dn with GPUMemoryIssue and in the P3 getting GPUMemoryIssue mostly because Pytorch is using only one of the GPU of 12GB out of 8*12GB.

Not getting anywhere to complete this training, even in local tried with a CPU machine and got the following error:

RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 67108864 bytes. Buy new RAM!

The model training script:

    corpus = ClassificationCorpus(data_folder, test_file='../data/exports/val.csv', train_file='../data/exports/train.csv')
                                          
    print("finished loading corpus")

    word_embeddings = [WordEmbeddings('glove'), FlairEmbeddings('news-forward-fast'), FlairEmbeddings('news-backward-fast')]

    document_embeddings = DocumentLSTMEmbeddings(word_embeddings, hidden_size=512, reproject_words=True, reproject_words_dimension=256)

    classifier = TextClassifier(document_embeddings, label_dictionary=corpus.make_label_dictionary(), multi_label=False)

    trainer = ModelTrainer(classifier, corpus, optimizer=Adam)

    trainer.train('../model_files', max_epochs=12,learning_rate=0.0001, train_with_dev=False, embeddings_storage_mode="none")

P.S.: I was able to train the same architecture with a smaller dataset in my local GPU machine with a 4GB GTX 1650 DDR5 memory and it was really quick.

Solution

Okay, so after 2 days of continuous debugging was able to find out the root cause. What I understood is Flair does not have any limitation on the sentence length, in the sense the word count, it is taking the highest length sentence as the maximum. So there it was causing issue, as in my case there were few content with 1.5 lakh rows which is too much to load the embedding of into the memory, even a 16GB GPU. So there it was breaking.

To solve this: For content with this much lengthy words, you can take chunk of n words(10K in my case) from these kind of content from any portion(left/right/middle anywhere) and trunk the rest, or simply ignore those records for training if it is very minimal in comparative count.

After this I hope you will be able to progress with your training, as it happened in my case.

P.S.: If you are following this thread and face similar issue feel free to comment back so that I can explore and help on your case of the issue.