I am getting this error, when i try to read my data
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 10752-10753: truncated \uXXXX escape
i tryed to put a r before the data to convert it in a raw string but i didnt work.
any advices??
pd.set_option('display.max_colwidth',100) # extend Columns display lenght to 100 Char
data = pd.read_csv(r'de_full_1.tsv',sep="\t", encoding= "unicode_escape")
data.head(100)
the rows in the mentioned are:
10751 GerSenNeg429 negative Im „Solar Valley“ geht die Sonne unter. 10752 GerSenNeg430 negative Leere Hallen, tiefe Bunker 10753 GerSenNeg431 negative Ein paar Topfpflanzen kümmern in der Zentralpforte der Hanwha-Q-Cells AG vor sich hin. 10754 GerSenNeg432 negative Der Betonbau, der wirkt wie ein verglaster Bunker, ist Endstation für Anfragen.
I can't be entirely sure because you are not providing the contents of the file around the mentioned byte position, but I am assuming that the data is just regular text that uses the \
character freely.
However, using encoding="unicode_escape"
means that the file is encoding Unicode character with a \uXXXX
sequence (e.g. \u03A8
for the character Ψ
), so if \u
or \U
is used in another way not matching a valid Unicode escape sequence (for example the string C:\Users\Somebody
), you are getting an error.
Your encoding
should probably a different one. It's hard to say which without seeing your file, but most likely it should be either utf_8
, ascii
or latin_1
.