Search code examples
javascriptcompressionarraybufferfilelistpako

Reconstructing file/folder structure of a decompressed zip file in JS


I am trying to reconstruct the file/folder structure of a decompressed zip file in the browser with JavaScript. Ideally, I'd like to have all files in a FileList (as if they just got uploaded through a web page) or other iterable object. For instance, a compressed folder containing

folder/file1
folder/file2
someotherfile

should be reconstructed to a FileList/iterable object in which each item corresponds to one of the files in the package (to my knowledge, there is no way to retain the folder structure in JS).

I've been quite successful in reading a tar.gz file and decompressing it using pako with the code at the bottom of this question. However, pako's result is one large ArrayBuffer (the inflator.result in the code below), and I can't make heads nor tails from this when trying to reconstruct the original files and folders. I am bumping into the following issues:

  1. How do I know where one file ends and another one begins in the ArrayBuffer?
  2. How do I determine the original file type of the current file?

Once I know this, I should be able to cast the ArrayBuffer data to a file with

File(segment, {type: filetype})

Searching the web also hasn't delivered any useful info. Does anyone have any clues on how to approach this problem?

Here is the code that I use to decompress the zipfile.

import pako from 'pako';
import isFunction from 'lodash/isFunction'

class FileStreamer {
  constructor(file, chunkSize = 64 * 1024) {
    this.file = file;
    this.offset = 0;
    this.chunkSize = chunkSize; // bytes
    this.rewind();
  }
  rewind() {
    this.offset = 0;
  }
  isEndOfFile() {
    return this.offset >= this.getFileSize();
  }
  readBlock() {
    const fileReader = new FileReader();
    const blob = this.file.slice(this.offset, this.offset + this.chunkSize);

    return new Promise((resolve, reject) => {
      fileReader.onloadend = (event) => {
        const target = (event.target);
        if (target.error) {
          return reject(target.error);
        }

        this.offset += target.result.byteLength;

        resolve({
          data: target.result,
          progress: Math.min(this.offset / this.file.size, 1)
        });
      };

      fileReader.readAsArrayBuffer(blob);
    });
  }
  getFileSize() {
    return this.file.size;
  }
}

export async function decompress(zipfile, onProgress) {
  const fs = new FileStreamer(zipfile);
  const inflator = new pako.Inflate();
  let block;

  while (!fs.isEndOfFile()) {
    block = await fs.readBlock();
    inflator.push(block.data, fs.isEndOfFile());
    if (inflator.err) {
      throw inflator.err
    }
    if (isFunction(onProgress)) onProgress(block.progress)
  }

  return inflator.result;
}

Solution

  • A .tar.gz file is a tar file ('Tape ARchive' - since originally bundling files for storage on tape was it's main purpose) which has then been subsequently compressed. You can get variants such as tar.bz for bzip based compression.

    Note this is distinct from the .zip file format originally created by PKZIP, which handles the bundling (tar) and compressing (gz) in a single step/specification.

    Anyway, given this what you're going to need is another tool to interpret the tar data and turn it into something useful for your purposes. I searched "tar file reader js" and found js-untar: https://github.com/InvokIT/js-untar

    This appears to take an ArrayBuffer and turn it into a series of File objects. Example code from the project page:

    import untar from "js-untar";
    
    // Load the source ArrayBuffer from a XMLHttpRequest (or any other way you may need).
    var sourceBuffer = [...];
    
    untar(sourceBuffer)
    .progress(function(extractedFile) {
        ... // Do something with a single extracted file.
    })
    .then(function(extractedFiles) {
        ... // Do something with all extracted files.
    });
    
    // or
    
    untar(sourceBuffer).then(
        function(extractedFiles) { // onSuccess
            ... // Do something with all extracted files.
        },
        function(err) { // onError
            ... // Handle the error.
        },
        function(extractedFile) { // onProgress
            ... // Do something with a single extracted file.
        }
    );
    

    That seems like what you need.

    (Please note I can't vouch for the suitability or reliability of this module, as I have never used it, but this should give you a starting point and context to proceed).