Search code examples
pythontextreplacems-wordpython-docx

How to use python-docx to replace text in a Word document and save


The oodocx module mentioned in the same page refers the user to an /examples folder that does not seem to be there.
I have read the documentation of python-docx 0.7.2, plus everything I could find in Stackoverflow on the subject, so please believe that I have done my “homework”.

Python is the only language I know (beginner+, maybe intermediate), so please do not assume any knowledge of C, Unix, xml, etc.

Task : Open a ms-word 2007+ document with a single line of text in it (to keep things simple) and replace any “key” word in Dictionary that occurs in that line of text with its dictionary value. Then close the document keeping everything else the same.

Line of text (for example) “We shall linger in the chambers of the sea.”

from docx import Document

document = Document('/Users/umityalcin/Desktop/Test.docx')

Dictionary = {‘sea’: “ocean”}

sections = document.sections
for section in sections:
    print(section.start_type)

#Now, I would like to navigate, focus on, get to, whatever to the section that has my
#single line of text and execute a find/replace using the dictionary above.
#then save the document in the usual way.

document.save('/Users/umityalcin/Desktop/Test.docx')

I am not seeing anything in the documentation that allows me to do this—maybe it is there but I don’t get it because everything is not spelled-out at my level.

I have followed other suggestions on this site and have tried to use earlier versions of the module (https://github.com/mikemaccana/python-docx) that is supposed to have "methods like replace, advReplace" as follows: I open the source-code in the python interpreter, and add the following at the end (this is to avoid clashes with the already installed version 0.7.2):

document = opendocx('/Users/umityalcin/Desktop/Test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
for word in words:
    if word in Dictionary.keys():
        print "found it", Dictionary[word]
        document = replace(document, word, Dictionary[word])
savedocx(document, coreprops, appprops, contenttypes, websettings,
    wordrelationships, output, imagefiledict=None) 

Running this produces the following error message:

NameError: name 'coreprops' is not defined

Maybe I am trying to do something that cannot be done—but I would appreciate your help if I am missing something simple.

If this matters, I am using the 64 bit version of Enthought's Canopy on OSX 10.9.3


Solution

  • UPDATE: There are a couple of paragraph-level functions that do a good job of this and can be found on the GitHub site for python-docx.

    1. This one will replace a regex-match with a replacement str. The replacement string will appear formatted the same as the first character of the matched string.
    2. This one will isolate a run such that some formatting can be applied to that word or phrase, like highlighting each occurence of "foobar" in the text or perhaps making it bold or appear in a larger font.

    The current version of python-docx does not have a search() function or a replace() function. These are requested fairly frequently, but an implementation for the general case is quite tricky and it hasn't risen to the top of the backlog yet.

    Several folks have had success though, getting done what they need, using the facilities already present. Here's an example. It has nothing to do with sections by the way :)

    for paragraph in document.paragraphs:
        if 'sea' in paragraph.text:
            print paragraph.text
            paragraph.text = 'new text containing ocean'
    

    To search in Tables as well, you would need to use something like:

    for table in document.tables:
        for row in table.rows:
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    if 'sea' in paragraph.text:
                        paragraph.text = paragraph.text.replace("sea", "ocean")
    

    If you pursue this path, you'll probably discover pretty quickly what the complexities are. If you replace the entire text of a paragraph, that will remove any character-level formatting, like a word or phrase in bold or italic.

    By the way, the code from @wnnmaw's answer is for the legacy version of python-docx and won't work at all with versions after 0.3.0.