Search code examples
pythonimage-processingpython-imaging-librarypytorchtarfile

Fastest way to read an image from huge uncompressed tar file in __getitem__ of PyTorch custom dataset


I have a huge dataset (2 million) of jpg images in one uncompressed TAR file. I also have a txt file each line is the name of the image in TAR file in order.

img_0000001.jpg
img_0000002.jpg
img_0000003.jpg
...

and images in tar file are exactly the same. I searched alot and find out tarfile module is the best one, but when I tried to read images from tar file using name, it takes too long. And the reason is, everytime I call getmemeber(name) method, it calls getmembers() method which scan whole tar file, then return a Namespace of all names, then start finding in this Namespace.

if it helps, my dataset size is 20GB single tar file.

I don't know it is better idea to first extract all then use extracted folders in my CustomDataset or reading directly from archive.

Here is the code I am using to read a single file from tar file:

        with tarfile.open('data.tar') as tf:
            tarinfo = tf.getmember('img_000001.jpg')
            image = tf.extractfile(tarinfo)
            image = image.read()
            image = Image.open(io.BytesIO(image))

I used this code in my __getitem__ method of CustomDataset class that loops over all names in filelist.txt

Thanks for any advice


Solution

  • tarfile seems to have caching for getmember, it reuses getmembers() results.

    But if you use the provided snipped in __getitem__, then for each item from the dataset the tar file is open and read fully, one image file extracted, then the tar file closed and the associated info is lost.

    The simplest way to resolve this is probably to open the tar file in your dataset's __init__ like self.tf = tarfile.open('data.tar'), but then you need to remember to close it in the end.