Search code examples
pythondeep-learningpytorchgpuclassification

Data shuffling changes gpu memory use drastically


I'm having a problem with a pytorch-ignite classification model. The code is quite long, so I'd like to first ask if anyone can explain this behavior in theory.

I am doing many classifications in a row. In each iteration, I select a subset of my data randomly and perform classification. My results were quite poor (accuracy ~ 0.6). I realized that in each iteration my training dataset is not balanced. I have a lot more class 0 data than class 1; so in a random selection, there tends to be more data from class 0.

So, I modified the selection procedure: I randomly select a N data points from class 1, then select N data points from class 0, then concatenate these two together (so the label order is like [1111111100000000] ). Finally, I shuffle this list to mix the labels before feeding it to the network.

The problem is, with this new data selection, my gpu runs out of memory within seconds. This was odd since with the first data selectin policy the code ran for tens of hours.

I retraced my steps: Turns out, if I do not shuffle my data in the end, meaning, if I keep the [1111111100000000] order, all is well. If I do shuffle the data, I need to reduce my batch_size by a factor of 5 or more so the code doesn't crash due to running out of gpu memory.

Any idea what is happening here? Is this to be expected?


Solution

  • I found the solution to my problem. But I don't really understand the details of why it works: When trying to choose a batch_size at first, I chose 90. 64 was slow, I was worried 128 was going to be too large, and a quick googling let me to believe keeping to powers of 2 shouldn't matter much. Turns out, it does matter! At least, when your classification training data is balanced. As soon as I changed my batch_size to a power of 2, there was no memory overflow. In fact, I ran the whole thing on a batch_size of 128 and there was no problem :)