Search code examples
pythondocx

Extract positions of bold words with Python


I would like to extract the position of bold words detected in a .docx file.

For that, I have used docx library, and it successfully detects the words with bold format. However, is not very useful to extract only the word, since you may find the same word, but in another format.

For example:

Let's assume that my file.docx contains : "My cat is not a normal cat"

from docx import *

document = Document('/path/to/file.docx')
            def bold(document):
                for para in document.paragraphs:
                    Listbolds = []
                    for run in para.runs:
                        if run.bold:
                            print run.text
                            word = run.text
                            Listbolds.append(word)
                return Listbolds

This function would give me the word "cat" as output. However, if I try to filter my text by those words which are not bold, and I use this, I would eliminate also the second "cat", which is not bold.

Any idea about how to get only the position of this word? For exaple, to obtain 2 as the word position.

Thank you all!


Solution

  • I don't get the docx library, but just by looking at the code, maybe change it to return a boolean list?

    document = Document('/path/to/file.docx')
    
    def get_bold_list(para):
        bold_list = []
        for run in para.runs:
            bold_list.append(run.bold)
        return bold_list
    
    for para in document.paragraphs:
        bold_list = get_bold_list(para)
        #do something with bold_list