Used 3 processors to do this
In the Execute Script processor-- Used groovy script and followed the steps in this link below. It works fine but the last few pages / last few lines of the last page does not get extracted. Tried it with different Pdf files and ran into same issue.
import org.apache.pdfbox.pdmodel.*
import org.apache.pdfbox.util.*
def flowFile = session.get()
if(!flowFile) return
def doc, info
def s = new PDFTextStripper()
flowFile = session.write(flowFile, {inputStream, outputStream ->
doc = PDDocument.load(inputStream)
info = doc.getDocumentInformation()
s.writeText(doc, new OutputStreamWriter(outputStream))
} as StreamCallback
)
flowFile = session.putAttribute(flowFile, 'pdf.page.count', "${doc.getNumberOfPages()}")
flowFile = session.putAttribute(flowFile, 'pdf.title', "${info.getTitle()}" )
flowFile = session.putAttribute(flowFile, 'pdf.author',"${info.getAuthor()}" );
flowFile = session.putAttribute(flowFile, 'pdf.subject', "${info.getSubject()}" );
flowFile = session.putAttribute(flowFile, 'pdf.keywords', "${info.getKeywords()}" );
flowFile = session.putAttribute(flowFile, 'pdf.creator', "${info.getCreator()}" );
flowFile = session.putAttribute(flowFile, 'pdf.producer', "${info.getProducer()}" );
flowFile = session.putAttribute(flowFile, 'pdf.date.creation', "${info.getCreationDate()}" );
flowFile = session.putAttribute(flowFile, 'pdf.date.modified', "${info.getModificationDate()}");
flowFile = session.putAttribute(flowFile, 'pdf.trapped', "${info.getTrapped()}" );
session.transfer(flowFile, REL_SUCCESS)
http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html
Is there a way to fix this?
the problem seems in this line of code:
s.writeText(doc, new OutputStreamWriter(outputStream))
you are creating the OutputStreamWriter that internally has a buffer that transferred to underlying output stream on call of OutputStreamWriter.flush()
or OutputStreamWriter.close()
. None of those methods called in your code.
You can use groovy method withWriter
to close writer after closure finished:
outputStream.withWriter{w-> s.writeText(doc, w) }