I want to extract all text of type 'Arial' from my .docx file. This is currently what I have but it's not producing any output (most of the text in the document is of type 'Arial').
from docx import *
document = Document('word.docx')
for paragraph in document.paragraphs:
for run in paragraph.runs:
if run.style == "Arial":
print(run.text)
Style is distinct from font in documents. Style in word means the application of a named style (like 'heading'). Sadly most people don't use styles, but just grab a region of text and change the font.
From a bit of digging with a test document, it seems that the font
attribute is what you are looking for. For a document like this:
Default Font.
Arial Font.
Default Arial Default Courier.
I can find the non-default font sections:
>>> from docx import Document
>>> from itertools import chain
>>> doc = Document("test.docx")
>>> runs = list(chain.from_iterable(list(p.runs) for p in doc.paragraphs))
>>> [r.font.name for r in runs]
[None, None, 'Arial', None, None, 'Arial', None, 'Courier New', None]
>>> [r.text for r in runs]
['Default Font.', '', 'Arial Font.\n', '\n', 'Default ', 'Arial', ' Default ', 'Courier.', '']