Search code examples
pythondocx

Extract only text of a certain font type from a .docx file using Python


I want to extract all text of type 'Arial' from my .docx file. This is currently what I have but it's not producing any output (most of the text in the document is of type 'Arial').

from docx import *

document = Document('word.docx')
for paragraph in document.paragraphs:
    for run in paragraph.runs:
        if run.style == "Arial":
            print(run.text)

Solution

  • Style is distinct from font in documents. Style in word means the application of a named style (like 'heading'). Sadly most people don't use styles, but just grab a region of text and change the font.

    From a bit of digging with a test document, it seems that the font attribute is what you are looking for. For a document like this:

    Default Font.
    
    Arial Font.
    
    Default Arial Default Courier.
    

    I can find the non-default font sections:

    >>> from docx import Document
    >>> from itertools import chain
    
    >>> doc = Document("test.docx")
    >>> runs = list(chain.from_iterable(list(p.runs) for p in doc.paragraphs))
    >>> [r.font.name for r in runs]
    [None, None, 'Arial', None, None, 'Arial', None, 'Courier New', None]
    >>> [r.text for r in runs]
    ['Default Font.', '', 'Arial Font.\n', '\n', 'Default ', 'Arial', ' Default ', 'Courier.', '']