Search code examples
javascriptgoogle-apps-scriptgoogle-drive-apiocruserscripts

Google App Script : how to convert PDF to GDOC in order to get OCR?


I'm trying to code something that search for a PDF (gmail) with a serial number I already have, save it in Drive, get OCR on it and read the content.

No problem with the first step, and the second one is managed with the following code, but the last two lines to open the document with DocumentApp in order to getText(), are not working :

  var serial = "123456789";
  var ret = DriveApp.searchFiles('fullText contains "' + serial + '"');
  if (ret.hasNext()) {
    var file = ret.next();
    var n_blob = Utilities.newBlob(file.getBlob().getDataAsString(), MimeType.PDF);
    n_blob.setName(serial);
    var n_file = DriveApp.createFile(n_blob);
    var rt = DocumentApp.openById(n_file.getId()); **//not working**
    var text = rt.getBody().getText(); **//not working**
  }

I tried many differents ways, including the solution based on Drive.Files.insert() which is not working anymore..

I'm pretty stuck here, if anyone has any idea or suggestion to help me out?

Thanks


Solution

    • You want to convert a PDF file to Google Document file.
      • file of var file = ret.next(); is always PDF file.
    • You want to achieve this using Google Apps Script.

    If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.

    Modification points:

    • Unfortunately, var n_blob = Utilities.newBlob(file.getBlob().getDataAsString(), MimeType.PDF) and var n_file = DriveApp.createFile(n_blob) cannot create Google Document. By this, an error occurs.

    Pattern 1:

    In this pattern, Drive.Files.copy is used for converting PDF to Google Document. Because in your question, I saw Drive.Files.insert() which is not working anymore.

    Modified script:

    Please modify your script as follows. Before you run the script, please enable Drive API at Advanced Google services.

    From:
    if (ret.hasNext()) {
      var file = ret.next();
      var n_blob = Utilities.newBlob(file.getBlob().getDataAsString(), MimeType.PDF);
      n_blob.setName(serial);
      var n_file = DriveApp.createFile(n_blob);
      var rt = DocumentApp.openById(n_file.getId()); **//not working**
      var text = rt.getBody().getText(); **//not working**
    }
    
    To:
    if (ret.hasNext()) {
      var file = ret.next();
      if (file.getMimeType() === MimeType.PDF) {
        var fileId = Drive.Files.copy({mimeType: MimeType.GOOGLE_DOCS}, file.getId()).id;
        var rt = DocumentApp.openById(fileId);
        var text = rt.getBody().getText();
        Logger.log(text)
      }
    }
    

    Pattern 2:

    I thought that Drive.Files.insert might be able to be used. So in this pattern, I propose the modified script using Drive.Files.insert. Could you please test this?

    Modified script:

    Please modify your script as follows. Before you run the script, please enable Drive API at Advanced Google services.

    From:
    if (ret.hasNext()) {
      var file = ret.next();
      var n_blob = Utilities.newBlob(file.getBlob().getDataAsString(), MimeType.PDF);
      n_blob.setName(serial);
      var n_file = DriveApp.createFile(n_blob);
      var rt = DocumentApp.openById(n_file.getId()); **//not working**
      var text = rt.getBody().getText(); **//not working**
    }
    
    To:
    if (ret.hasNext()) {
      var file = ret.next();
      if (file.getMimeType() === MimeType.PDF) {
        var fileId = Drive.Files.insert({title: serial, mimeType: MimeType.GOOGLE_DOCS}, file.getBlob()).id;
        var rt = DocumentApp.openById(fileId);
        var text = rt.getBody().getText();
        Logger.log(text)
      }
    }
    

    Note:

    • Unfortunately, I cannot understand about Drive.Files.insert() which is not working anymore. So if above modified script didn't work, please tell me. I would like to think of other methods.
    • When you check the log, if you cannot see the texts of Google Document converted from PDF, it means that all files of var file = ret.next(); are not PDF type. Please be careful this.

    References:

    If I misunderstood your question and this was not the direction you want, I apologize.