Search code examples
pythonpython-3.xunicodepython-unicode

Unicode-escaped file processing error


I have a raw text file containing only the following line, and no newline:

Q853 \u0410\u043D\u0434\u0440\u0435\u0439 \u0410\u0440\u0441\u0435\u043D\u044C\u0435\u0432\u0438\u0447 \u0422\u0430\u0440\u043A\u043E\u0432\u0441\u043A\u0438\u0439

The characters are escaped as shown above, meaning that the \u05E9 is really a backslash, followed by 5 alphanumeric characters (and not an Unicode character). I am trying to decode the file using the following code:

import codecs

with codecs.open("wikidata-terms20.nt", 'r', encoding='unicode_escape') as input:
    with open("wikidata-terms3.nt", "w") as output:
        for line in input:
            output.write(line)

Using print is not possible here, see in the comments.

Running it gives me the following error:

Traceback (most recent call last):
  File "terms2.py", line 5, in <module>
    for line in input:
  File "C:\Program Files\Python35\lib\codecs.py", line 711, in __next__
    return next(self.reader)
  File "C:\Program Files\Python35\lib\codecs.py", line 642, in __next__
    line = self.readline()
  File "C:\Program Files\Python35\lib\codecs.py", line 555, in readline
    data = self.read(readsize, firstline=True)
  File "C:\Program Files\Python35\lib\codecs.py", line 501, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 67-71: truncated \uXXXX escape

What is going on?

I am running Python 3.5.1 on Windows 8.1, and the code seems to work for most other Unicode characters (this line is the first one to cause the crash).

See edit history for the original question.


Solution

  • It seems that the data read by the decoder is truncated at (after) character#72 (0-based character #71). That obviously is somehow related to the this bug.

    The following code produces the same error as in your example:

    open("wikidata-terms20.nt", 'r').readline()
    open("wikidata-terms20.nt", 'r').readline(72)
    

    Increasing the readline size above the actual size of the input or setting it to -1 eliminates the error:

    open("wikidata-terms20.nt", 'r').readline(1000)
    open("wikidata-terms20.nt", 'r').readline(-1)
    

    Evidently, for line in input: obtains the line to be decoded with readline(), effectively truncating the data-to-be-decoded to 72 characters.

    So here are a couple of workarounds:

    Workaround 1:

    import codecs
    
    with open("wikidata-terms20.nt", 'r') as input:
        with open("wikidata-terms3.nt", "w") as output:
            for line in input:
                output.write(codecs.decode(line, 'unicode_escape'))
    

    Workaround 2:

    import codecs
    
    with codecs.open("wikidata-terms20.nt", 'r', encoding='unicode_escape') as input:
        with open("wikidata-terms3.nt", "w") as output:
            for line in input.readlines():
                output.write(line)