Search code examples
pythonpdfpdfminer

How can I extract font color of text within a PDF in Python with PDFMiner?


How can I extract font color from text within a PDF?

I already tried to explore LTText or LTChar objects using PDFMiner, but it seems that this module only allows to extract font size and style, not color.


Solution

  • PDFMiner's LTChar object has 'graphicstate' attribute which has 'scolor' (stroking color) and 'ncolor' (non stroking color) attributes, which can be used to obtain text color information. Here's working code snippet (based on the code from one of the answers) that outputs font info for each text line component:

    from pdfminer.high_level import extract_pages
    from pdfminer.layout import LTTextContainer, LTChar
    import sys
    
    with open(sys.argv[1], 'rb') as scr_file:
        for page_layout in extract_pages(scr_file):
            for element in page_layout:
                if isinstance(element, LTTextContainer):
                    fontinfo = set()
                    for text_line in element:
                        for character in text_line:
                            if isinstance(character, LTChar):
                                fontinfo.add(character.fontname)
                                fontinfo.add(character.size)
                                fontinfo.add(character.graphicstate.scolor)
                                fontinfo.add(character.graphicstate.ncolor)
                    print("\n", element.get_text(), fontinfo)