Search code examples
groovyapache-nifi

NiFi extract from PDF to text


Used 3 processors to do this

  1. GetFile
  2. ExecuteScript
  3. PutFile

In the Execute Script processor-- Used groovy script and followed the steps in this link below. It works fine but the last few pages / last few lines of the last page does not get extracted. Tried it with different Pdf files and ran into same issue.

   import org.apache.pdfbox.pdmodel.*
import org.apache.pdfbox.util.*

def flowFile = session.get()
if(!flowFile) return

def doc, info
def s  = new PDFTextStripper()

flowFile = session.write(flowFile, {inputStream, outputStream ->
 doc = PDDocument.load(inputStream)
 info = doc.getDocumentInformation()

        s.writeText(doc, new OutputStreamWriter(outputStream))
    } as StreamCallback
)
flowFile = session.putAttribute(flowFile, 'pdf.page.count', "${doc.getNumberOfPages()}")
flowFile = session.putAttribute(flowFile, 'pdf.title', "${info.getTitle()}" )
flowFile = session.putAttribute(flowFile, 'pdf.author',"${info.getAuthor()}" );
flowFile = session.putAttribute(flowFile, 'pdf.subject', "${info.getSubject()}" );
flowFile = session.putAttribute(flowFile, 'pdf.keywords', "${info.getKeywords()}" );
flowFile = session.putAttribute(flowFile, 'pdf.creator', "${info.getCreator()}" );
flowFile = session.putAttribute(flowFile, 'pdf.producer', "${info.getProducer()}" );
flowFile = session.putAttribute(flowFile, 'pdf.date.creation', "${info.getCreationDate()}" );
flowFile = session.putAttribute(flowFile, 'pdf.date.modified', "${info.getModificationDate()}");
flowFile = session.putAttribute(flowFile, 'pdf.trapped', "${info.getTrapped()}" );   
session.transfer(flowFile, REL_SUCCESS)

http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html

Is there a way to fix this?


Solution

  • the problem seems in this line of code:

    s.writeText(doc, new OutputStreamWriter(outputStream))
    

    you are creating the OutputStreamWriter that internally has a buffer that transferred to underlying output stream on call of OutputStreamWriter.flush() or OutputStreamWriter.close(). None of those methods called in your code.

    You can use groovy method withWriter to close writer after closure finished:

    outputStream.withWriter{w-> s.writeText(doc, w) }