Search code examples
pythonms-wordopenxmldocx

How can I search a word in a Word 2007 .docx file?


I'd like to search a Word 2007 file (.docx) for a text string, e.g., "some special phrase" that could/would be found from a search within Word.

Is there a way from Python to see the text? I have no interest in formatting - I just want to classify documents as having or not having "some special phrase".


Solution

  • More exactly, a .docx document is a Zip archive in OpenXML format: you have first to uncompress it.
    I downloaded a sample (Google: some search term filetype:docx) and after unzipping I found some folders. The word folder contains the document itself, in file document.xml.