Search code examples
pythonpdfpdfminer

How to count characters based on its font?


For every page in a given PDF file it's possible to list the fonts used:

$ pdffonts -f 10 -l 10 file.pdf
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
[none]                               Type 3            Custom           yes no  no      12  0
DIIDPF+ArialMT                       CID TrueType      Identity-H       yes yes yes     95  0
DIIEDH+Arial                         CID TrueType      Identity-H       yes yes no     101  0
DIIEBG+TimesNewRomanPSMT             CID TrueType      Identity-H       yes yes yes    106  0
DIIEDG+Arial                         CID TrueType      Identity-H       yes yes no     112  0
Arial                                TrueType          WinAnsi          yes no  no     121  0

I need to identify likely problematic fonts based on pdffonts output and count characters based on its font. I achieved it by implementing the following snippet:

def count_fonts_ocurrencies_by_page(pdf_filepath):
    page_layout = next(extract_pages(pdf_filepath))

    fonts = []

    for element in page_layout:
        if isinstance(element, LTTextContainer):
            for text_line in element:
                for character in text_line:
                    if isinstance(character, LTChar):
                        fonts.append(character.fontname)

    return Counter(fonts)

I'm looking forward to find a straightforward way to do the same (or close, I only need to know something like a percentage of font usage on a single PDF page) without iterating every char (if possible) or maybe without using a whole module, like pdfminer, just for one function and for one PDF page at time. It would be also helpful if I could do something similar (re)using the minimum code from pdfminer, as it's built in a modular way.


Solution

  • You could try using pdftohtml from the same package of pdffonts and then parse html file with xpath taking into account the styles

    pdftohtml -f 1 -l 1 -c -s -i -fontfullname fonts.pdf
    

    Generated doc

    <!DOCTYPE html>
    <html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
    <head>
    <title>fonts-html.html</title>
    
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
     <br/>
    <style type="text/css">
    <!--
        p {margin: 0; padding: 0;}  .ft10{font-size:16px;font-family:BAAAAA+NotoSans-CondensedExtraBold;color:#000000;}
        .ft11{font-size:16px;font-family:CAAAAA+DejaVuMathTeXGyre-Regular;color:#000000;}
        .ft12{font-size:13px;font-family:DAAAAA+Baekmuk-Headline;color:#000000;}
        .ft13{font-size:13px;font-family:EAAAAA+LMMono9-Regular;color:#000000;}
        .ft14{font-size:13px;font-family:FAAAAA+CantarellRegular;color:#000000;}
        .ft15{font-size:13px;font-family:GAAAAA+Courier;color:#000000;}
    -->
    </style>
    </head>
    <body bgcolor="#A0A0A0" vlink="blue" link="blue">
    <div id="page1-div" style="position:relative;width:892px;height:1263px;">
    <img width="892" height="1263" src="fonts001.png" alt="background image"/>
    <p style="position:absolute;top:64px;left:86px;white-space:nowrap" class="ft10"><b>Font1</b></p>
    <p style="position:absolute;top:91px;left:86px;white-space:nowrap" class="ft11">font3</p>
    <p style="position:absolute;top:109px;left:86px;white-space:nowrap" class="ft12">font4</p>
    <p style="position:absolute;top:124px;left:86px;white-space:nowrap" class="ft13">font5</p>
    <p style="position:absolute;top:144px;left:86px;white-space:nowrap" class="ft14">font6</p>
    <p style="position:absolute;top:163px;left:86px;white-space:nowrap" class="ft15">font7</p>
    </div>
    </body>
    </html>
    

    Parsing html with python and counting characters by font (class attribute)

    from lxml import html                      
    tree = html.parse(r'/home/luis/tmp/fonts-html.html')
    eleList = tree.xpath("//p[@class='ft10']")
    len(eleList[0].text_content())
    # text length: 5 
    eleList = tree.xpath("//p[@class[contains(.,'ft')]]")
    eleList[0].get('class')
    # class name: 'ft10'