python-3.x csv pytorch dataloader pytorch-dataloader

What is the fastest way to load data from multiple csv files

I am working with multiple csv files, each containing multiple 1D data. I have about 9000 such files and total combined data is about 40 GB.

I have written a dataloader like this:

class data_gen(torch.utils.data.Dataset):
    def __init__(self, files):
        
        self.files = files
        my_data = np.genfromtxt('/data/'+files, delimiter=',')
        self.dim = my_data.shape[1]
        self.data = []
        
    def __getitem__(self, i):

        file1 = self.files
        my_data = np.genfromtxt('/data/'+file1, delimiter=',')
        self.dim = my_data.shape[1]

        for j in range(my_data.shape[1]):
            tmp = np.reshape(my_data[:,j],(1,my_data.shape[0]))
            tmp = torch.from_numpy(tmp).float()
            self.data.append(tmp)        
        
        return self.data[i]

    def __len__(self): 
        
        return self.dim

The way I am loading the whole dataset into the dataloader is like through a for loop:

for x_train in tqdm(train_files):
    train_dl_spec = data_gen(x_train)
        train_loader = torch.utils.data.DataLoader(
        train_dl_spec, batch_size=128, shuffle=True, num_workers=8, pin_memory=True)
        for data in train_loader:

But this is working terribly slow. I was wondering if I could store all of that data in one file but I don’t have enough RAM. So is there a way around it?

Let me know if there’s a way.

Solution

I've never used pytorch before, and I confess I don't really know what's going on. Nonetheless I'm almost certain you're using Dataset wrong.

As I understand it, the Dataset is an abstraction of all the data where each index returns a sample. Say each of your 9000 files has 10 rows (samples), 21 would refer to the 3rd file and the 2nd row (using 0-indexing).

Because you have so much data you don't want to load everything into memory. So the Dataset should manage just getting one value, and the DataLoader creates batches of the values.

There's almost certainly some optimisation that can be applied to what I've done, but maybe this can start you off. I created the directory csvs with these files:

❯ cat csvs/1.csv
1,2,3
2,3,4
3,4,5

❯ cat csvs/2.csv
21,21,21
34,34,34
66,77,88

Then I created this Dataset class. It takes a directory as input (where all the CSVs are stored). Then the only thing is stores in memory is the name of every file and the number of lines it has. When an item is requested we find out which file contains that index, and then return the Tensor for that line.

By only ever iterating through files, we never store file contents in memory. An improvement here though would not to iterate over the list of files to find out which one is relevant, and to make use of generators and state when accessing consecutive indexes.

(Because accessing when accessing index 8, in an 10 line file we iterate through the first 7 uselessly, which we can't help. But then when accessing index 9, it would be better to work out that we could just return the next one, rather than iterating through the first 8 lines again.)

import numpy as np
from functools import lru_cache
from pathlib import Path
from pprint import pprint
from torch.utils.data import Dataset, DataLoader

@lru_cache()
def get_sample_count_by_file(path: Path) -> int:
    c = 0
    with path.open() as f:
        for line in f:
            c += 1
    return c


class CSVDataset:
    def __init__(self, csv_directory: str, extension: str = ".csv"):
        self.directory = Path(csv_directory)
        self.files = sorted((f, get_sample_count_by_file(f)) for f in self.directory.iterdir() if f.suffix == extension)
        self._sample_count = sum(f[-1] for f in self.files)

    def __len__(self):
        return self._sample_count

    def __getitem__(self, idx):
        current_count = 0
        for file_, sample_count in self.files:
            if current_count <= idx < current_count + sample_count:
                # stop when the index we want is in the range of the sample in this file
                break  # now file_ will be the file we want
            current_count += sample_count

        # now file_ has sample_count samples
        file_idx = idx - current_count  # the index we want to access in file_
        with file_.open() as f:
            for i, line in enumerate(f):
                if i == file_idx:
                    data = np.array([float(v) for v in line.split(",")])
                    return torch.from_numpy(data)

Now we can use the DataLoader as I believe is intended:

dataset = CSVDataset("csvs")
loader = DataLoader(dataset, batch_size=4)

pprint(list(enumerate(loader)))

"""
[(0,
  tensor([[ 1.,  2.,  3.],
        [ 2.,  3.,  4.],
        [ 3.,  4.,  5.],
        [21., 21., 21.]], dtype=torch.float64)),
 (1, tensor([[34., 34., 34.],
        [66., 77., 88.]], dtype=torch.float64))]
"""

You can see this correctly returns batches of data. Rather than printing this out you can process each batch and only store that batch in memory.

See the docs for further information: https://pytorch.org/tutorials/recipes/recipes/custom_dataset_transforms_loader.html#part-3-the-dataloader