Search code examples
javascriptgoogle-apps-scriptgmail-addonspdf.js

How to get producer of a PDF in google app script


I am trying to write a gmail add-on where I iterate over all emails and create a report based on their producers. Iterating over emails is the easiest part and I have done that, however I can't find any way to get producer line of each PDFs. So far I tried

  • analyzing the blob, however this is something like writing a PDF library to parse all syntax. producer tag is not clearly present
  • adding pdf.js, which is a third party open source tool to extract such information. However, I couldn't add it due to ES3 - ES6 support issue.

What's the best way to get the producer line of a PDF in google app script?

Thank you


Solution

    • You want to retrieve the value of Producer from PDF file.

    I could understand like above. If my understanding is correct, how about this sample script? In this sample script, from your shared PDF files, the value of Producer is retrieved by 2 regular expressions from the file content. Please think of this as one of several answers.

    Sample script:

    When you use this script, please set the folder ID of folder that PDF files are put. This script retrieves the value from all PDF files in a folder.

    var folderId = "### folderId ###";
    var files = DriveApp.getFolderById(folderId).getFilesByType(MimeType.PDF);
    var regex = [/Producer\((\w.+)\)/i, /<pdf:Producer>(\w.+)<\/pdf:Producer>/i];
    var result = [];
    while (files.hasNext()) {
      var file = files.next();
      var content = file.getBlob().getDataAsString();
      var r = regex.reduce(function(s, e) {
        var m = content.match(e);
        if (Array.isArray(m)) s = m[1];
        return s;
      }, "");
      result.push({
        fileName: file.getName(),
        fileId: file.getId(),
        vaueOfProducer: r,
      });
    }
    Logger.log(result); // Result
    

    Result:

    This sample result was retrieved from a folder (my Google Drive) that the shared 3 PDF files were put.

    [
      {
        "fileName": "2348706469653861032.pdf",
        "fileId": "###",
        "vaueOfProducer": "iText� 7.1.5 �2000-2019 iText Group NV \(iText; licensed version\)"
      },
      {
        "fileName": "Getting started with OneDrive.pdf",
        "fileId": "###",
        "vaueOfProducer": "Adobe PDF library 15.00"
      },
      {
        "fileName": "DITO-Salesflow-040419-1359-46.pdf",
        "fileId": "###",
        "vaueOfProducer": "iText 2.1.7 by 1T3XT"
      }
    ]
    

    Note:

    • About the file of 2348706469653861032.pdf, the characters which cannot be displayed are included in the value of Producer.
    • This is a sample script. So please modify this for your situation.