Search code examples
zipcompressionarchivedataformat

With which data format can I distribute a big number of small files?


I am about to publish a machine learning dataset. This dataset contains about 170,000 files (png images of 32px x 32px). I first wanted to share them by a zip archive (57.2MB). However, extracting those files takes extremely long (more than 15 minutes - I'm not sure when I started).

Is there a better format to share those files?


Solution

  • I just did some Benchmarks:

    Experiments / Benchmarks

    I used dtrx to extract the following and time dtrx filename to get the time.

    Format      File size     Time to extract
    .7z          27.7 MB      > 1h
    .tar.bz2     29.1 MB      7.18s
    .tar.lzma    29.3 MB      6.43s
    .xz          29.3 MB      6.56s
    .tar.gz      33.3 MB      6.56s
    .zip         57.2 MB      > 30min
    .jar         70.8 MB      5.64s
    .tar        177.9 MB      5.40s
    

    Interesting. The extracted content is 47 MB big. Why is .tar more than 3 times the size of its content?

    Anyway. I think tar.bz2 might be a good choice.