Search code examples
pythonencodingutf-8lxmldecoding

Docx (xml) file parsing error on Python 'charmap' codec can't decode byte 0x98 in position 7618: character maps to <undefined>


im trying to parse docx file. I unziped it first, then tried to read Document.xml file with with open(..) and its raise error that "'charmap' codec can't decode byte 0x98 in position 7618: character maps to ". XML is "UTF-8" encoding:

enter image description here

Error:

enter image description here

I wrote the following code:

        with open(self.tempDir + self.CONFIG['main_xml']) as xml_file:
            self.dom_xml = etree.parse(xml_file)

I treid to force encode to UTF-8, but then i can't read etree.fromstring(..) correctly

7618 symbol (from error) is : enter image description here

Please help me. How to read xml file correctly? Thnks


Solution

  • This works without errors on your file:

    import zipfile
    import xml.etree.ElementTree as ET
    
    zipfile.ZipFile('file.docx').extractall()
    root = ET.parse('word/document.xml').getroot()