I have a huge dataset (2 million) of jpg images in one uncompressed TAR file. I also have a txt file each line is the name of the image in TAR file in order.
img_0000001.jpg
img_0000002.jpg
img_0000003.jpg
...
and images in tar file are exactly the same.
I searched alot and find out tarfile
module is the best one, but when I tried to read images from tar file using name
, it takes too long. And the reason is, everytime I call getmemeber(name)
method, it calls getmembers()
method which scan whole tar file, then return a Namespace
of all names, then start finding in this Namespace
.
if it helps, my dataset size is 20GB single tar file.
I don't know it is better idea to first extract all then use extracted folders in my CustomDataset
or reading directly from archive.
Here is the code I am using to read a single file from tar file:
with tarfile.open('data.tar') as tf:
tarinfo = tf.getmember('img_000001.jpg')
image = tf.extractfile(tarinfo)
image = image.read()
image = Image.open(io.BytesIO(image))
I used this code in my __getitem__
method of CustomDataset
class that loops over all names in filelist.txt
Thanks for any advice
tarfile
seems to have caching for getmember
, it reuses getmembers()
results.
But if you use the provided snipped in __getitem__
, then for each item from the dataset the tar file is open and read fully, one image file extracted, then the tar file closed and the associated info is lost.
The simplest way to resolve this is probably to open the tar file in your dataset's __init__
like self.tf = tarfile.open('data.tar')
, but then you need to remember to close it in the end.