Search code examples
javascriptgoogle-app-engineuploadbinaryblobstore

Upload binary data to AppEngine Blobstore via HTTP request


I'm trying to figure out the lowest data-overhead way to upload/download binary data to Google AppEngine's Blobstore from a JavaScript initiated HTTP request. Ideally, I would like to submit the binary data directly, i.e. as unencoded 8-bit values; maybe in a POST request that looks something like this:

...
Content-Type: multipart/form-data; boundary=boundary;

--boundary
Content-Disposition: form-data; name="a"; filename="b"
Content-Type: application/octet-stream

@#^%(^Qtr...
--boundary--

Here, @#^%(^Qtr... ideally represents arbitrary 8-bit binary data.

Specifically, I am trying to understand the following:

  • Is it possible to directly upload 8-bit binary data, or would I need to encode the data somehow, like a base-64 MIME encoding?
  • If I use a different encoding, would Blobstore save the data as 8-bit binary internally or in the encoded format? I.e. would a base-64 encoding increase my storage cost by 33%?
  • Along the same lines: Does encoding overhead increase outgoing bandwidth cost?
  • Is there a better way to format the POST request so I don't need to come up with a boundary that doesn't appear in my binary data? E.g. is there a way to specify a Content-Length rather than a boundary?
  • In the GET request to retrieve the data, can I simply expect to have binary data end up in the return string, or is the server going to automatically encode the data somehow?
  • If I need to use some encoding, which one would be the best choice among the supported options for essentially random 8-bit data? (base-64, UTF-8, someting else?)

Solution

  • Even though I received the Tumbleweed Badge for this question, let me report on my progress anyways in case somebody out there does care:

    This question turned out to pose 3 independent problems:

    1. Uploading data to BlobStore efficiently
    2. Making sure BlobStore saves it in the smallest possible format
    3. Finding a way to reliably download the data

    Let's start with (3), because this ends up posing the biggest issue:

    So far I have not been able to find a way to download true 8-bit data to the browser via XHR. Using mime-types like application/octet-stream leads to only 7 bits reaching the client reliably, unless the data is downloaded to a file. The best solution I found, is using the following mime-type for the data:

    text/plain; charset=ISO-8859-1
    

    This seems to be supported in all browsers that I've tested: IE 8, Chrome 21, FF 12.0, Opera 11.61, Safari 5.1.2 under Windows, and Android 2.3.3.

    With this, it is possible to transfer almost any 8-bit value, with the following restrictions/caveats:

    • Character 0x00 is interpreted as the end of the input string in IE8 and must therefore be avoided.
    • Most browsers interpret charset ISO-8859-1 as Windows-1252 instead, leading to characters 0x80 through 0x9F being changed accordingly. This can be fixed, though, as the changes are unambiguous. (see http://en.wikipedia.org/wiki/Windows-1252#Codepage_layout)
    • Characters 0x81, 0x8D, 0x8F, 0x90, 0x9D are reserved in the Windows-1252 charset and Opera returns an error code for these, therefore these need to be avoided as well.

    Overall, this leaves us with 250 out of the 256 characters which we can use. With the required basis-change for the data, this means an outgoing-data-overhead of under 0.5%, which I guess I'm ok with.

    So, now to problem (1) and (2):

    As incoming bandwidth is free, I've decided to reduce the priority of solving problem (1) in favor of problems (2) and (3). Turns out, using the following POST request does the trick then:

    ...
    Content-Type: multipart/form-data; boundary=-
    
    ---
    Content-Disposition: form-data; name="a"; filename="b"
    Content-Type: text/plain; charset=ISO-8859-1
    Content-Transfer-Encoding: base64
    
    abcd==
    -----
    

    Here, abcd== is the base64-MIME-encoded data consisting of the above described 250 allowed characters (see http://en.wikipedia.org/wiki/Base64#Examples, GAE uses + and / as the last 2 characters). The encoding is necessary (correct me if I'm wrong) as calling the XHR send() function with String data will result in UTF-8 encoding of the string, which screws up the data received by the server. Unfortunately passing ArrayBuffers and Blobs to the send() function isn't available in all browsers yet to circumvent this issue more elegantly.

    Now the good news: The AppEngine BlobStore decodes this data automatically and correctly and stores it without overhead! Therefore, using the base64-encoding only leads to slower data-uploads from the client, but does not result in additional hosting cost (unless maybe a couple CPU cycles for the decoding).

    Aside: The AppEngine development-server will report the encoded size (i.e. 33% larger) for the stored blob, both in the admin console and in a retrieved BlobInfo record. The production servers do not have this issue, though, and report the correct blob size.

    Conclusion:

    Using Content-Transfer-Encoding base64 for uploading binary data of Content-Type text/plain; charset=ISO-8859-1, which may not contain characters 0x00, 0x81, 0x8D, 0x8F, 0x90, and 0x9D, leads to reliable data transfer for many tested browsers with a storage/outgoing-bandwidth overhead of less than half a percent. The upload-overhead of the base64-encoded data is 33%, which is better than the expected 50% for UTF-8 (for random 8-bit data), but still far from desirable.

    What I don't know is: Is this the optimal solution, or could one do better? Anyone up for the challenge?