Search code examples
pythonxmlms-worddocxdoc

How to get XML from DOC (not DOCX)?


For a DOCX document I do:

document = zipfile.ZipFile(path)
soup = BeautifulSoup(document.read('word/document.xml'), 'html.parser')

How to do this for DOC document?


Solution

  • You don't.

    DOCX are tough enough to process, and they're XML-based and documented by international standards organizations. DOC files are binary and proprietary.

    Don't try to process DOC files directly. Convert them to DOCX first.

    See: