NiFi ExecuteScript (Groovy): Using Pdfbox to extract text/images from PDF: error loading modules

NiFi 1.11.4

Hi there,

I found an interesting solution for extracting text and images from pdf files with ExecuteScript (Groovy):

The Groovy Script starts with

import org.apache.pdfbox.pdmodel.*
import org.apache.pdfbox.util.*
def flowFile = session.get()
if(!flowFile) return
def s = new PDFTextStripper()

With PDFBox 1.8.16 the script run without errors, but the PDFTextStripper is always empty (and yes: the pdf files contains text, not images)

With PDFBox 2.0.19 the script didn't run:

Module Directory for pdfbox 2.0

29.04.2020  12:56         2.715.618 pdfbox-2.0.19.jar
29.04.2020  19:36           257.911 pdfbox-debugger-2.0.19.jar
29.04.2020  19:36            81.206 pdfbox-tools-2.0.19.jar
29.04.2020  19:36           247.912 preflight-2.0.19.jar
29.04.2020  19:36           132.182 xmpbox-2.0.19.jar
29.04.2020  19:36         1.561.265 fontbox-2.0.19.jar

error

Caused by: org.codehaus.groovy.control.MultipleCompilationErrorsException: 
startup failed:
Script9.groovy: 18: unable to resolve class PDFTextStripper 
@ line 18, column 9.def 
  s = new PDFTextStripper()

Any idea, what is missing?

Thanx Frank

Solution

The PDFTextStripper has been refactored to a new package. In pdfbox 1.8.x it indeed was in org.apache.pdfbox.util but since 2.0.0 it is in org.apache.pdfbox.text.

Thus, you need to adjust your import statements for use with pdfbox 2.x.