I want to extract plain text from a PDF and run it through a named entity recognition function that spits out text and string positions.
I'm thinking of using pdfminer to extract text from my PDF. And I wonder if it's possible to translate back to page coordinates from string positions. For example, if my extracted text were 'Hello World', how do I get the page coordinates of 'World' given its string positions [5:11]?
Thank you!
How to reverse PDF Word positions is not easy so lets use a basic example
%PDF-1.4
1 0 obj <</Type /Catalog /Pages 2 0 R>>
endobj
2 0 obj <</Type /Pages /Kids [3 0 R] /Count 1>>
endobj
3 0 obj<</Type /Page /Parent 2 0 R /Resources 4 0 R /MediaBox [0 0 500 800] /Contents 6 0 R>>
endobj
4 0 obj<</Font <</F1 5 0 R>>>>
endobj
5 0 obj<</Type /Font /Subtype /Type1 /BaseFont /Helvetica>>
endobj
6 0 obj
<</Length 44>>
stream
BT /F1 24 Tf 175 720 Td (Hello World!)Tj ET
endstream
endobj
xref
0 7
0000000000 65535 f
0000000009 00000 n
0000000056 00000 n
0000000111 00000 n
0000000212 00000 n
0000000250 00000 n
0000000317 00000 n
trailer <</Size 7/Root 1 0 R>>
startxref
406
%%EOF
So here it is very clear the Helvetica string is at x=175 y=720 (i.e. near the top of a default page) but the page is 800 units high not a more natural 842 pt so first problem is what do you mean by co-ordinates and what projection/transformation is at play?
So we can easily say the x value for World! will usually be positive from the left edge, but the origin could be top right of that media thus World! would be negative both in x and y.
For everyday PDF pages we work with default origin for text as bottom left corner of crop box or media box, unless we see stated otherwise. Likewise for images their origin is normally calculated from Top Left corner, but that may be outside their crop boundary. Libraries will help give you relative values, which with luck are simplified to one common origin and scale, with little conflict due to transformation.
In this case we could expect for bottom left of World! x = 236 & y = 80 (800-720).
However a simple HTML conversion may use top:94px;left:263px
for the two words together.
<body bgcolor="#A0A0A0" vlink="blue" link="blue">
<div id="page1-div" style="position:relative;width:750px;height:1200px;">
<img width="750" height="1200" src="hello001.png" alt="background image"/>
<p style="position:absolute;top:94px;left:263px;white-space:nowrap" class="ft00">Hello World!</p>
</div>
</body>
If you need a precise position for text then a Printer Trace of components gives the most accurate printout answer
mutool trace hello.pdf
<fill_text colorspace="DeviceGray" color="0" transform="1 0 0 -1 0 800">
<span font="Helvetica" wmode="0" bidi="0" trm="24 0 0 24">
<g unicode="H" glyph="H" x="175" y="720" adv=".722"/>
<g unicode="e" glyph="e" x="192.328" y="720" adv=".556"/>
<g unicode="l" glyph="l" x="205.672" y="720" adv=".222"/>
<g unicode="l" glyph="l" x="211" y="720" adv=".222"/>
<g unicode="o" glyph="o" x="216.328" y="720" adv=".556"/>
<g unicode=" " glyph="space" x="229.672" y="720" adv=".278"/>
<g unicode="W" glyph="W" x="236.344" y="720" adv=".944"/>
<g unicode="o" glyph="o" x="259" y="720" adv=".556"/>
<g unicode="r" glyph="r" x="272.344" y="720" adv=".333"/>
<g unicode="l" glyph="l" x="280.336" y="720" adv=".222"/>
<g unicode="d" glyph="d" x="285.664" y="720" adv=".556"/>
<g unicode="!" glyph="exclam" x="299.008" y="720" adv=".278"/>
</span>
</fill_text>
So W of World starts at x="236.344" y="720"
We can also calculate the Width of that Wor[l]d either by adding each advance
Total Advance for World
= 2.611 units which transformed by 24 = 62.664 wide = 22.10647 mm
or we can do it by simpler subtraction since we know ! is at 299.008 so -236.344 also = 62.664