Search code examples
pythonxmlunicodeexpat-parser

Python.expat can't parse XML file with bad symbols. How to go around?


I'm trying to parse an XML file (OSM data) with expat, and there are lines with some Unicode characters that expat can't parse:

<tag k="name"
v="абвгдежзиклмнопр�?туфхцчшщьыъ�?ю�?�?БВГДЕЖЗИКЛМ�?ОПРСТУФХЦЧШЩЬЫЪЭЮЯ" />

<tag k="name" v="Cin\x8e? Rex" />

(XML file encoding in the opening line is "UTF-8")

The file is quite old, and there must have been errors. In modern files I don't see UTF-8 errors, and they are parsed fine. But what if my program meets a broken symbol, what workaround can I make? Is it possible to join bz2 codec (I parse a compressed file) and utf-8 codec to ignore the broken characters, or change them to "?"?


Solution

  • Not sure if '�' characters were introduced by copy-pasting string here, but if you have them in original data, then it seems to be generator problem which introduced \uFFFD charactes as:

    "used to replace an incoming character whose value is unknown or unrepresentable in Unicode"

    citied from: http://www.fileformat.info/info/unicode/char/fffd/index.htm

    Workaround? Just idea for extension:

    good = True
    buf = None
    while True:
    if good:
            buf = f.read(buf_size)
            else:
            # try again with cleaned buffer
            pass
            try:
                xp.Parse(buf, len(buf) == 0)
                if (len(buf) == 0):
                        break
            good = True
        except ExpatError:
            if xp.ErrorCode  == XML_ERROR_BAD_CHAR_REF:
                # look at ErrorByteIndex (or nearby)
                # for 0xEF 0xBF 0xBD (UTF8 replacement char) and remove it
                good = False
            else:
                # other errors processing
                pass
    

    Or clean input buffer instead + corner cases (partial sequence at the buffer end). I can't recall if python's expat allows to assign custom error handler. That would be easier then.

    If i clean yours sample from '�' characters it's processed ok. \xd1 does not fail.

    OSM data?