I am about to publish a machine learning dataset. This dataset contains about 170,000 files (png images of 32px x 32px). I first wanted to share them by a zip archive (57.2MB). However, extracting those files takes extremely long (more than 15 minutes - I'm not sure when I started).
Is there a better format to share those files?
I just did some Benchmarks:
I used dtrx
to extract the following and time dtrx filename
to get the time.
Format File size Time to extract
.7z 27.7 MB > 1h
.tar.bz2 29.1 MB 7.18s
.tar.lzma 29.3 MB 6.43s
.xz 29.3 MB 6.56s
.tar.gz 33.3 MB 6.56s
.zip 57.2 MB > 30min
.jar 70.8 MB 5.64s
.tar 177.9 MB 5.40s
Interesting. The extracted content is 47 MB big. Why is .tar
more than 3 times the size of its content?
Anyway. I think tar.bz2
might be a good choice.