Search code examples
pythonpdfminerpdf-scrapingpdfminersix

Pdfminer randomly changes text size when converting pdf to html


An example of the type of pdf I'm trying to scrape. I'm trying to scrape a pdf document for the number of papers, where the names of papers are in a specific font and size (10px).

Given that other elements of this pdf have words that are not names of papers, but in the same font and size, my solution was to count the number of papers by checking whether there is at least one hyphen in the text. However, pdf2txt.py for some reason changes the size of text in the third line of the text on the pdf, which prevents me from counting the paper.

In the image attached, this happens at the bottom of the page where "University - Liquidity spillover ...Market" is in font size 9, while the rest of the text is in font size 10.

Why is it doing this, and how can I prevent pdfminer from changing the size of the text randomly?

This is the code I used in the command line to convert to html.

pdf2txt.py -o output.html -t html input.pdf

Solution

  • It does not matter if using OCR or HTML for conversion to PDF both behave different but with similar outputs. Likewise there are much the same issues when reverse working. As far as I know from samples PDFMiner is hard coded to one fixed scale for conversion but most documents are not fixed that way.

    Whenever a source is not defined in points such as scanner pixels (Px.) then they need to be rounded to PDF units, and often that may be described as closest Pt. size

    Without the PDF to test, here is a different interpretation of that area so the top line is seen rounded off as 16 Pt. (actual scalar units=66.6984) and the green and blue lines as 17 Pt. (actual scalar units=70.8671) and 17 Pt. (unchanged)

    enter image description here

    Hence to aid recognition in a conversion the source units should first be adapted to nearest 1/2 Points (closest 10 Twips)

    Answer

    Although not the exact reason (without the input and output to test) it is commonly found.

    Point sizes reported are indicative and usually rounded, by the reader (16 & 17), as the PDF does not use points but variable scalar units (here 66.6984 & 70.8671).

    Since there are no concepts of lines being from one source, every consecutive line can be a different height, or even contain text of fluctuating heights (desirable for Maths Equations).

    To control output heights, they should ideally be defined per line as "Point heights" in the source.

    Pdfminer should convert a 10 pt object to a 13.333 px equivalence and we see from its own simple samples a 24 Page units PDF font is output as a rounded off 27px HTML text (by my calculation it should have been 32px ??), but both are only based on the assumption no other scalars are involved. enter image description here enter image description here