I'm trying to figure out the lowest data-overhead way to upload/download binary data to Google AppEngine's Blobstore from a JavaScript initiated HTTP request. Ideally, I would like to submit the binary data directly, i.e. as unencoded 8-bit values; maybe in a POST request that looks something like this:
...
Content-Type: multipart/form-data; boundary=boundary;
--boundary
Content-Disposition: form-data; name="a"; filename="b"
Content-Type: application/octet-stream
@#^%(^Qtr...
--boundary--
Here, @#^%(^Qtr...
ideally represents arbitrary 8-bit binary data.
Specifically, I am trying to understand the following:
boundary
that doesn't appear in my binary data? E.g. is there a way to specify a Content-Length rather than a boundary?Even though I received the Tumbleweed Badge for this question, let me report on my progress anyways in case somebody out there does care:
This question turned out to pose 3 independent problems:
Let's start with (3), because this ends up posing the biggest issue:
So far I have not been able to find a way to download true 8-bit data to the browser via XHR. Using mime-types like application/octet-stream leads to only 7 bits reaching the client reliably, unless the data is downloaded to a file. The best solution I found, is using the following mime-type for the data:
text/plain; charset=ISO-8859-1
This seems to be supported in all browsers that I've tested: IE 8, Chrome 21, FF 12.0, Opera 11.61, Safari 5.1.2 under Windows, and Android 2.3.3.
With this, it is possible to transfer almost any 8-bit value, with the following restrictions/caveats:
Overall, this leaves us with 250 out of the 256 characters which we can use. With the required basis-change for the data, this means an outgoing-data-overhead of under 0.5%, which I guess I'm ok with.
So, now to problem (1) and (2):
As incoming bandwidth is free, I've decided to reduce the priority of solving problem (1) in favor of problems (2) and (3). Turns out, using the following POST request does the trick then:
...
Content-Type: multipart/form-data; boundary=-
---
Content-Disposition: form-data; name="a"; filename="b"
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: base64
abcd==
-----
Here, abcd==
is the base64-MIME-encoded data consisting of the above described 250 allowed characters (see http://en.wikipedia.org/wiki/Base64#Examples, GAE uses + and / as the last 2 characters). The encoding is necessary (correct me if I'm wrong) as calling the XHR send() function with String data will result in UTF-8 encoding of the string, which screws up the data received by the server. Unfortunately passing ArrayBuffers and Blobs to the send() function isn't available in all browsers yet to circumvent this issue more elegantly.
Now the good news: The AppEngine BlobStore decodes this data automatically and correctly and stores it without overhead! Therefore, using the base64-encoding only leads to slower data-uploads from the client, but does not result in additional hosting cost (unless maybe a couple CPU cycles for the decoding).
Aside: The AppEngine development-server will report the encoded size (i.e. 33% larger) for the stored blob, both in the admin console and in a retrieved BlobInfo record. The production servers do not have this issue, though, and report the correct blob size.
Conclusion:
Using Content-Transfer-Encoding base64
for uploading binary data of Content-Type text/plain; charset=ISO-8859-1
, which may not contain characters 0x00, 0x81, 0x8D, 0x8F, 0x90, and 0x9D, leads to reliable data transfer for many tested browsers with a storage/outgoing-bandwidth overhead of less than half a percent. The upload-overhead of the base64-encoded data is 33%, which is better than the expected 50% for UTF-8 (for random 8-bit data), but still far from desirable.
What I don't know is: Is this the optimal solution, or could one do better? Anyone up for the challenge?