Search code examples
pythonunicode-escapes

UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 10752-10753: truncated \uXXXX escape


I am getting this error, when i try to read my data

UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 10752-10753: truncated \uXXXX escape

i tryed to put a r before the data to convert it in a raw string but i didnt work.

any advices??

reading the data

pd.set_option('display.max_colwidth',100)                                       # extend Columns display lenght to 100 Char
data = pd.read_csv(r'de_full_1.tsv',sep="\t", encoding= "unicode_escape")
data.head(100)

the rows in the mentioned are:

10751 GerSenNeg429 negative Im „Solar Valley“ geht die Sonne unter. 10752 GerSenNeg430 negative Leere Hallen, tiefe Bunker 10753 GerSenNeg431 negative Ein paar Topfpflanzen kümmern in der Zentralpforte der Hanwha-Q-Cells AG vor sich hin. 10754 GerSenNeg432 negative Der Betonbau, der wirkt wie ein verglaster Bunker, ist Endstation für Anfragen.

picture of rows

First rows


Solution

  • I can't be entirely sure because you are not providing the contents of the file around the mentioned byte position, but I am assuming that the data is just regular text that uses the \ character freely.

    However, using encoding="unicode_escape" means that the file is encoding Unicode character with a \uXXXX sequence (e.g. \u03A8 for the character Ψ), so if \u or \U is used in another way not matching a valid Unicode escape sequence (for example the string C:\Users\Somebody), you are getting an error.

    Your encoding should probably a different one. It's hard to say which without seeing your file, but most likely it should be either utf_8, ascii or latin_1.