For a DOCX document I do:
document = zipfile.ZipFile(path)
soup = BeautifulSoup(document.read('word/document.xml'), 'html.parser')
How to do this for DOC document?
DOCX are tough enough to process, and they're XML-based and documented by international standards organizations. DOC files are binary and proprietary.
Don't try to process DOC files directly. Convert them to DOCX first.
See: