im trying to parse docx file. I unziped it first, then tried to read Document.xml file with with open(..)
and its raise error that "'charmap' codec can't decode byte 0x98 in position 7618: character maps to ". XML is "UTF-8" encoding:
Error:
I wrote the following code:
with open(self.tempDir + self.CONFIG['main_xml']) as xml_file:
self.dom_xml = etree.parse(xml_file)
I treid to force encode to UTF-8, but then i can't read etree.fromstring(..)
correctly
Please help me. How to read xml file correctly? Thnks
This works without errors on your file:
import zipfile
import xml.etree.ElementTree as ET
zipfile.ZipFile('file.docx').extractall()
root = ET.parse('word/document.xml').getroot()