Search code examples
pythonxmlasciilxmlparsexml

UnicodeDecodeError when parsing XML on mac but works on PC


When parsing a XML file with:

from lxml import etree

with open('cortex_full.xml', 'r') as infile:
    root = etree.parse(infile)

I am getting the UnicodeDecodeError below. This only happens on my Mac though - if I parse the same file with the same script on my work PC, everything works fine.

File "/Users/Desktop/CPET/xml_test2.py", line 5, in <module>
    root = etree.parse(infile)
  File "src/lxml/lxml.etree.pyx", line 3442, in lxml.etree.parse (src/lxml/lxml.etree.c:81701)
  File "src/lxml/parser.pxi", line 1832, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:118888)
  File "src/lxml/parser.pxi", line 1852, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:119171)
  File "src/lxml/parser.pxi", line 1747, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:117959)
  File "src/lxml/parser.pxi", line 1162, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:112686)
  File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105881)
  File "src/lxml/parser.pxi", line 702, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:107548)
  File "src/lxml/lxml.etree.pyx", line 324, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:12152)
  File "src/lxml/parser.pxi", line 373, in lxml.etree._FileReaderContext.copyToBuffer (src/lxml/lxml.etree.c:103210)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 783: ordinal not in range(128)

This seems to be quite a common occurrence given the number of threads on here, however none of the suggested fixes seem to be working for this instance. Any ideas for getting it to work? Full XML file here


Solution

  • Posting an answer that worked for me for future reference. Credit goes to @Burhan Khalid for the answer.

    Need to set encoding to utf-8 when opening the xml file.

    with open('cortex_full.xml', 'r', encoding='utf-8') as infile: