I'm trying to build a pytorch
project on an IterableDataset
with zarr
as storage backend.
class Data(IterableDataset):
def __init__(self, path, start=None, end=None):
super(Data, self).__init__()
store = zarr.DirectoryStore(path)
self.array = zarr.open(store, mode='r')
if start is None:
start = 0
if end is None:
end = self.array.shape[0]
assert end > start
self.start = start
self.end = end
def __iter__(self):
return islice(self.array, self.start, self.end)
This works quite nicely with small test-datasets but once i move to my actual dataset (480 000 000 x 290) i'm running into a memory leak. I've tried logging out the python heap periodically as everything slows to a crawl, but i couldn't see anything increasing in size abnormally, so the lib i used (pympler
) didn't actually catch the memory leak.
I'm kind of at my wits end, so if anybody has any idea how to further debug this, it would be greatly appreciated.
Cross-posted on PyTorch Forums.
Turns out that I had an issue in my validation routine:
with torch.no_grad():
for batch in tqdm(testloader, **params):
x = batch[:, 1:].to(device)
y = batch[:, 0].unsqueeze(0).T
y_test_pred = torch.sigmoid(sxnet(x))
y_pred_tag = torch.round(y_test_pred)
y_pred_list.append(y_pred_tag.cpu().numpy())
y_list.append(y.numpy())
I originally thought that I am well clear of running into troubles with appending my results to lists, but the issue is that the result of .numpy
was an array of arrays (since the original datatype was a 1xn Tensor).
Adding .flatten()
on the numpy arrays has fixed this issue and the RAM consumption is now as I originally provisioned.