I'm trying to extract text from pdf documents. I've tested several tools like PDFBox
, TET
, PDFTextStream
and so on, but none of them is good for extracting the text of Persian multi-columns pdf documents.
Currently I'm trying to combine good features of this tools and using some tricks on them. Now I want to know that how I can detect number of columns of a page and how to split the texts of these columns.
Specially I want to know which class of PDFBox
or PDFTextStream
is responsible for column detection and how it work.
I can only speak for PDFTextStream
, but in order to understand how it works, you want to understand, roughly, how PDFTextStream
looks at a PDF document.
Each document is made up of Pages
, which are made up of Blocks
(of which there can be many and nested). Blocks
will ultimately contain Lines
, which will contain TextUnits
.
Each of these units have an x
, y
, width
and height
property. All a PDF is are these basic units laid out based on their coordinates. When you ask PDFTextStream
to "read" a page, or a region, it looks at the objects and how they are laid out on the X, Y plain and use an approximation of how that would translate to text. This is why you get errors, because there's no 100% foolproof way to turn this structure into machine-readable, structured data.
In PDFTextStream
, you should look at the getRegionText
function and example. PDFTextStream is proprietary (the reason why I'm moving to PDFBox), so I can't give you details about the algorithms used to fetch the text, but they're based on the above oversimplification.
Good luck.