google-apps-script google-api google-drive-api mime-types

How to process response body of workspace docs export in HTML (mimetype application/zip) from Batch Request of Google Drive API

I am able to handle the export, in Plain Text, of 100 Google Workspace docs to my server, for each Google Drive Batch Request. I have followed Kanshi Tanaike’s excellent examples "Efficient File Management using Batch Requests with Google Apps Script"

However, for Google docs export in HTML (rather than plain text), I do not know how to process the response body. (BTW, the mime type is application/zip for HTML export).

  I am hoping someone can provide some basic information about how to process the response body. Or perhaps an example in Google Apps Script or Elixir which I could then follow?

I have tried splitting the response body using the batch marker into the 100 requests (as I do with the plain text example). I am left with what what might be one or 100 zip files but I everything I have tried to unzip them has given an error. I presume that I am splitting the response.body incorrectly. I am not an experienced programmer and have no experience in working with zip files. I even tried opening split response.body with unzip utilities without success.

Note that I can handle the response.body of the export of a single workspace doc, in HTML, not within a Batch request, using:

url = "https://www.googleapis.com/drive/v3/files/#{doc_id}/export?mimeType=application/zip"

This is because the response is very clearly a set of tuples of either {item_name, text_string) (for the HTML part or the document, which can be processed directly) or {item_name, byte_sequence} (for, eg, images within the doc).

Actually, I am at the moment only interested in the HTML part rather than images, even in the batch export (is there a way to only export the HTML in a batch request? )

Solution

I believe your goal is as follows.

You want to export multiple Google Documents as application/zip with the batch requests using Google Apps Script.

Modification points:

In the current stage, the response value from the batch request is as follows.
```
  --batch_###
  Content-Type: application/http
  Content-ID: response-1

  HTTP/1.1 200 OK
  Content-Disposition: attachment
  Content-Type: application/zip
  Date: ###
  Expires: ###
  Cache-Control: private, max-age=0
  Content-Length: 1000

  ### data ###
  --batch_###--
```
- I think that ### data ### is the zip file as the binary data. In this case, when the value is retrieved with res.getContentText(), the response value is converted to the string value. By this, even when the binary data is retrieved, the retrieved data is broken. I think that this is the reason for your current issue of but I everything I have tried to unzip them has given an error..
In order to correctly decode the retrieved data, in this case, it is required to process the response data with the binary level. In this case, it is required to process the data with the byte array.

In this answer, I would like to propose a simple sample script for decoding the response data from the batch request (In this case, Google Document files are exported as application/zip.).

Sample script:

Please copy and paste the following script to the script editor of the Google Apps Script project, and please set your folder ID and document IDs.

And, please enable Drive API at Advanced Google services.

/**
 * Ref: https://tanaikech.github.io/2023/03/08/split-binary-data-with-search-data-using-google-apps-script/
 * Split byteArray by a search data.
 * @param {Array} baseData Input byteArray of base data.
 * @param {Array} searchData Input byteArray of search data using split.
 * @return {Array} An array including byteArray.
 */
function splitByteArrayBySearchData_(baseData, searchData) {
  if (!Array.isArray(baseData) || !Array.isArray(searchData)) {
    throw new Error("Please give byte array.");
  }
  const search = searchData.join("");
  const bLen = searchData.length;
  const res = [];
  let idx = 0;
  do {
    idx = baseData.findIndex((_, i, a) => [...Array(bLen)].map((_, j) => a[j + i]).join("") == search);
    if (idx != -1) {
      res.push(baseData.splice(0, idx));
      baseData.splice(0, bLen);
    } else {
      res.push(baseData.splice(0));
    }
  } while (idx != -1);
  return res;
}

/**
 * Ref: https://cloud.google.com/blog/topics/developers-practitioners/efficient-file-management-using-batch-requests-google-apps-script
 * Create a request body of batch requests and request it.
 * 
 * @param {Object} object Object for creating request body of batch requests.
 * @returns {Object} UrlFetchApp.HTTPResponse
 */
function batchRequests_(object) {
  const { batchPath, requests } = object;
  const boundary = "sampleBoundary12345";
  const lb = "\r\n";
  const payload = requests.reduce((r, e, i, a) => {
    r += `Content-Type: application/http${lb}`;
    r += `Content-ID: ${i + 1}${lb}${lb}`;
    r += `${e.method} ${e.endpoint}${lb}`;
    r += e.requestBody ? `Content-Type: application/json; charset=utf-8" ${lb}${lb}` : lb;
    r += e.requestBody ? `${JSON.stringify(e.requestBody)}${lb}` : "";
    r += `--${boundary}${i == a.length - 1 ? "--" : ""}${lb}`;
    return r;
  }, `--${boundary}${lb}`);
  const params = {
    muteHttpExceptions: true,
    method: "post",
    contentType: `multipart/mixed; boundary=${boundary}`,
    headers: { Authorization: "Bearer " + ScriptApp.getOAuthToken() },
    payload,
  };
  return UrlFetchApp.fetch(`https://www.googleapis.com/${batchPath}`, params);
}

// Please run this function.
function main() {
  const folderId = "###"; // Please set folder ID you want to put the files.
  // Please set your document Ids.
  const documentIds = [
    "### Document ID1 ###",
    "### Document ID2 ###",
    "### Document ID3 ###",
    ,
    ,
    ,
  ];

  // Run batch requests.
  const requests = documentIds.map((id) => ({
    method: "GET",
    endpoint: `https://www.googleapis.com/drive/v3/files/${id}/export?mimeType=application/zip`,
  }));
  const object = { batchPath: "batch/drive/v3", requests };
  const res = batchRequests_(object);
  if (res.getResponseCode() != 200) {
    throw new Error(res.getContentText());
  }

  // Parse data as binary data, and create the data as Blob.
  const check = res.getContentText().match(/--batch.*/);
  if (!check) {
    throw new Error("Valid response value is not returned.");
  }
  const search = check[0];
  const baseData = res.getContent();
  const searchData = Utilities.newBlob(search).getBytes();
  const res1 = splitByteArrayBySearchData_(baseData, searchData);
  res1.shift();
  res1.pop();
  const blobs = res1.map((e, i) => {
    const rrr = splitByteArrayBySearchData_(e, [13, 10, 13, 10]);
    const data = rrr.pop();
    const metadata = Utilities.newBlob(rrr.flat()).getDataAsString();
    const dataSize = Number(metadata.match(/Content-Length:(.*)/)[1]);
    return Utilities.newBlob(data.splice(0, dataSize)).setName(`sampleName${i + 1}.zip`);
  });

  // Create blobs as the files in Google Drive.
  const folder = DriveApp.getFolderById(folderId);
  blobs.forEach(b => {
    if (b) {
      console.log({ filename: b.getName(), fileSize: b.getBytes().length })
      folder.createFile(b);
    }
  });
}

When this script is run, the zip files including the HTML data converted from Google Documents are created in the folder. And, the sample filenames are sampleName1.zip, sampleName2.zip, sampleName3.zip,,,.

Note:

IMPORTANT: I'm not sure whether this method can be used for 100 batch requests. Because, when the response size is more than 50 MB, an error might occur. I'm worried about this. So, when you test this script, please test the script using a small number of sample Google Documents.
I noticed I am at the moment only interested in the HTML part rather than images just now. As another approach, when mimeType=application/zip is changed to mimeType=text/html, it seems that only HTML data is included in the response value as the string. In this case, the response data can be parsed as a string.

Reference:

Efficient File Management using Batch Requests with Google Apps Script (Author: me)