Search code examples
javascriptamazon-s3gzippako

Extracting gzip data in Javascript with Pako - encoding issues


I am trying to run what I expect is a very common use case:

I need to download a gzip file (of complex JSON datasets) from Amazon S3, and decompress(gunzip) it in Javascript. I have everything working correctly except the final 'inflate' step.

I am using Amazon Gateway, and have confirmed that the Gateway is properly transferring the compressed file (used Curl and 7-zip to verify the resulting data is coming out of the API). Unfortunately, when I try to inflate the data in Javascript with Pako, I am getting errors.

Here is my code (note: response.data is the binary data transferred from AWS):

apigClient.dataGet(params, {}, {})
      .then( (response) => {
        console.log(response);  //shows response including header and data

        const result = pako.inflate(new Uint8Array(response.data), { to: 'string' });
        // ERROR HERE: 'buffer error'  

      }).catch ( (itemGetError) => {
        console.log(itemGetError);
      });

Also tried a version to do it splitting the binary data input into an array by adding the following before the inflate:

const charData = response.data.split('').map(function(x){return x.charCodeAt(0); });
const binData = new Uint8Array(charData);
const result = pako.inflate(binData, { to: 'string' });
//ERROR: incorrect header check

I suspect I have some sort of issue with the encoding of the data and I am not getting it into the proper format for Uint8Array to be meaningful.

Can anyone point me in the right direction to get this working?

For clarity:

  • As the code above is listed, I get a buffer error. If I drop the Uint8Array, and just try to process 'result.data' I get the error: 'incorrect header check', which is what makes me suspect that it is the encoding/format of my data which is the issue.
  • The original file was compressed in Java using GZIPOutputStream with UTF-8 and then stored as a static file (i.e. randomname.gz).

  • The file is transferred through the AWS Gateway as binary, so it is exactly the same coming out as the original file, so 'curl --output filename.gz {URLtoS3Gateway}' === downloaded file from S3.

  • I had the same basic issue when I used the gateway to encode the binary data as 'base64', but did not try a whole lot around that effort, as it seems easier to work with the "real" binary data than to add the base64 encode/decode in the middle. If that is a needed step, I can add it back in.

I have also tried some of the example processing found halfway through this issue: https://github.com/nodeca/pako/issues/15, but that didn't help (I might be misunderstanding the binary format v. array v base64).


Solution

  • I was able to figure out my own problem. It was related to the format of the data being read in by Javascript (either Javascript itself or the Angular HttpClient implementation). I was reading in a "binary" format, but it was not the same as that recognized/used by pako. When I read the data in as base64, and then converted to binary with 'atob', I was able to get it working. Here is what I actually have implemented (starting at fetching from the S3 file storage).

    1) Build AWS API Gateway that will read a previously stored *.gz file from S3.

    • Create a standard "get" API request to S3 that supports binary. (http://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-payload-encodings-configure-with-console.html)
    • Make sure the Gateway will recognize the input type by setting 'Binary types' (application/gzip worked for me, but others like application/binary-octet and image/png should work for other types of files besides *.gz). NOTE: that setting is under the main API selections list on the left of the API config screen.
    • Set the 'Content Handling' to "Convert to text(if needed)" by selecting the API Method/{GET} -> Integration Request Box and updating the 'Content Handling' item. (NOTE: the example in the link above recommends "passthrough". DON'T use that as it will pass the unreadable binary format.) This is the step that actually converts from binary to base64.

    At this point you should be able to download a base64 verion of your binary file via the URL (test in browser or with Curl).

    2) I then had the API Gateway generate the SDK and used the respective apiGClient.{get} call.

    3) Within the call, translate the base64->binary->Uint8 and then decompress/inflate it. My code for that:

        apigClient.myDataGet(params, {}, {})
          .then( (response) => {
            // HttpClient result is in response.data
            // convert the incoming base64 -> binary
            const strData = atob(response.data);
    
            // split it into an array rather than a "string"
            const charData = strData.split('').map(function(x){return x.charCodeAt(0); });
    
            // convert to binary
            const binData = new Uint8Array(charData);
    
            // inflate
            const result = pako.inflate(binData, { to: 'string' });
            console.log(result);
          }).catch ( (itemGetError) => {
            console.log(itemGetError);
          });
      }