Search code examples
javascriptnode.jsaxiosblobarraybuffer

how does axios handle blob vs arraybuffer as responseType?


I'm downloading a zip file with axios. For further processing, I need to get the "raw" data that has been downloaded. As far as I can see, in Javascript there are two types for this: Blobs and Arraybuffers. Both can be specified as responseType in the request options.

In a next step, the zip file needs to be uncompressed. I've tried two libraries for this: js-zip and adm-zip. Both want the data to be an ArrayBuffer. So far so good, I can convert the blob to a buffer. And after this conversion adm-zip always happily extracts the zip file. However, js-zip complains about a corrupted file, unless the zip has been downloaded with 'arraybuffer' as the axios responseType. js-zip does not work on a buffer that has been taken from a blob.

This was very confusing to me. I thought both ArrayBuffer and Blob are essentially just views on the underlying memory. There might be a difference in performance between downloading something as a blob vs buffer. But the resulting data should be the same, right ?

Well, I decided to experiment and found this:

If you specify responseType: 'blob', axios converts the response.data to a string. Let's say you hash this string and get hashcode A. Then you convert it to a buffer. For this conversion, you need to specify an encoding. Depending on the encoding, you will get a variety of new hashes, let's call them B1, B2, B3, ... When specifying 'utf8' as the encoding, I get back to the original hash A.

So I guess when downloading data as a 'blob', axios implicitly converts it to a string encoded with utf8. This seems very reasonable.

Now you specify responseType: 'arraybuffer'. Axios provides you with a buffer as response.data. Hash the buffer and you get a hashcode C. This code does not correspond to any code in A, B1, B2, ...

So when downloading data as an 'arraybuffer', you get entirely different data?

It now makes sense to me that the unzipping library js-zip complains if the data is downloaded as a 'blob'. It probably actually is corrupted somehow. But then how is adm-zip able to extract it? And I checked the extracted data, it is correct. This might only be the case for this specific zip archive, but nevertheless surprises me.

Here is the sample code I used for my experiments:

//typescript import syntax, this is executed in nodejs
import axios from 'axios';
import * as crypto from 'crypto';

axios.get(
    "http://localhost:5000/folder.zip", //hosted with serve
    { responseType: 'blob' }) // replace this with 'arraybuffer' and response.data will be a buffer
    .then((response) => {
        console.log(typeof (response.data));

        // first hash the response itself
        console.log(crypto.createHash('md5').update(response.data).digest('hex'));

        // then convert to a buffer and hash again
        // replace 'binary' with any valid encoding name
        let buffer = Buffer.from(response.data, 'binary');
        console.log(crypto.createHash('md5').update(buffer).digest('hex'));
        //...

What creates the difference here, and how do I get the 'true' downloaded data?


Solution

  • From axios docs:

    // `responseType` indicates the type of data that the server will respond with
    // options are: 'arraybuffer', 'document', 'json', 'text', 'stream'
    //   browser only: 'blob'
    responseType: 'json', // default
    

    'blob' is a "browser only" option.

    So from node.js, when you set responseType: "blob", "json"will actually be used, which I guess fallbacks to "text" when no parse-able JSON data has been fetched.

    Fetching binary data as text is prone to generate corrupted data. Because the text returned by Body.text() and many other APIs are USVStrings (they don't allow unpaired surrogate codepoints ) and because the response is decoded as UTF-8, some bytes from the binary file can't be mapped to characters correctly and will thus be replaced by � (U+FFDD) replacement character, with no way to get back what that data was before: your data is corrupted.

    Here is a snippet explaining this, using the header of a .png file 0x89 0x50 0x4E 0x47 as an example.

    (async () => {
    
      const url = 'https://upload.wikimedia.org/wikipedia/commons/4/47/PNG_transparency_demonstration_1.png';
      // fetch as binary
      const buffer = await fetch( url ).then(resp => resp.arrayBuffer());
    
      const header = new Uint8Array( buffer ).slice( 0, 4 );
      console.log( 'binary header', header ); // [ 137, 80, 78, 61 ]
      console.log( 'entity encoded', entityEncode( header ) );
      // [ "U+0089", "U+0050", "U+004E", "U+0047" ]
      // You can read more about  (U+0089) character here
      // https://www.fileformat.info/info/unicode/char/0089/index.htm
      // You can see in the left table how this character in UTF-8 needs two bytes (0xC2 0x89)
      // We thus can't map this character correctly in UTF-8 from the UTF-16 codePoint,
      // it will get discarded by the parser and converted to the replacement character
      
      // read as UTF-8 
      const utf8_str = await new Blob( [ header ] ).text();
      console.log( 'read as UTF-8', utf8_str ); // "�PNG"
      // build back a binary array from that string
      const utf8_binary = [ ...utf8_str ].map( char => char.charCodeAt( 0 ) );
      console.log( 'Which is binary', utf8_binary ); // [ 65533, 80, 78, 61 ]
      console.log( 'entity encoded', entityEncode( utf8_binary ) );
      // [ "U+FFDD", "U+0050", "U+004E", "U+0047" ]
      // You can read more about character � (U+FFDD) here
      // https://www.fileformat.info/info/unicode/char/0fffd/index.htm
      //
      // P (U+0050), N (U+004E) and G (U+0047) characters are compatible between UTF-8 and UTF-16
      // For these there is no encoding lost
      // (that's how base64 encoding makes it possible to send binary data as text)
      
      // now let's see what fetching as text holds
      const fetched_as_text = await fetch( url ).then( resp => resp.text() );
      const header_as_text = fetched_as_text.slice( 0, 4 );
      console.log( 'fetched as "text"', header_as_text ); // "�PNG"
      const as_text_binary = [ ...header_as_text ].map( char => char.charCodeAt( 0 ) );
      console.log( 'Which is binary', as_text_binary ); // [ 65533, 80, 78, 61 ]
      console.log( 'entity encoded', entityEncode( as_text_binary ) );
      // [ "U+FFDD", "U+0050", "U+004E", "U+0047" ]
      // It's been read as UTF-8, we lost the first byte.
      
    })();
    
    function entityEncode( arr ) {
      return Array.from( arr ).map( val => 'U+' + toHex( val ) );
    }
    function toHex( num ) {
      return num.toString( 16 ).padStart(4, '0').toUpperCase();
    }


    There is natively no Blob object in node.js, so it makes sense axios didn't monkey-patch it just so they can return a response no-one else would be able to consume anyway.

    From a browser, you'd have exactly the same responses:

    function fetchAs( type ) {
      return axios( {
        method: 'get',
        url: 'https://upload.wikimedia.org/wikipedia/commons/4/47/PNG_transparency_demonstration_1.png',
        responseType: type
      } );
    }
    
    function loadImage( data, type ) {
      // we can all pass them to the Blob constructor directly
      const new_blob = new Blob( [ data ], { type: 'image/jpg' } );
      // with blob: URI, the browser will try to load 'data' as-is
      const url = URL.createObjectURL( new_blob );
      
      img = document.getElementById( type + '_img' );
      img.src = url;
      return new Promise( (res, rej) => { 
        img.onload = e => res(img);
        img.onerror = rej;
      } );
    }
    
    [
      'json', // will fail
      'text', // will fail
      'arraybuffer',
      'blob'
    ].forEach( type =>
      fetchAs( type )
       .then( resp => loadImage( resp.data, type ) )
       .then( img => console.log( type, 'loaded' ) )
       .catch( err => console.error( type, 'failed' ) )
    );
    <script src="https://unpkg.com/axios/dist/axios.min.js"></script>
    
    <figure>
      <figcaption>json</figcaption>
      <img id="json_img">
    </figure>
    <figure>
      <figcaption>text</figcaption>
      <img id="text_img">
    </figure>
    <figure>
      <figcaption>arraybuffer</figcaption>
      <img id="arraybuffer_img">
    </figure>
    <figure>
      <figcaption>blob</figcaption>
      <img id="blob_img">
    </figure>