I have an SQL Server database full of badly-encoded characters. The issue was the ETL was written in Java with the wrong default encoding, and was therefore directly loading the badly-encoded characters. The database is literally holding the characters ö instead of ö
I've fixed the issue with the ETL, now I would like to fix the existing data. I know I can use REPLACE() to replace the characters, but my issue is that I do not have a definitive list of characters and what they're "supposed" to be.
I have scoured the internet for some sort of definitive list but I cannot find one. I just need a big list of this:
ö > ö
é > é
etc
EDIT: I first thought the "bad" encoding was ISO 8859-15, actually I just checked the previous default charset and it is windows-1252. I have updated the title of the post.
It's a big list. UTF-8 encodes all of Unicode in one to four bytes per Unicode code point. There are 1,114,111 Unicode code points (1,112,064 if UTF-16 surrogates are discounted).
It is easier to read the strings using Windows-1252, encode them to bytes using Windows-1252, decode them with UTF-8. Example in Python since its what I have available to demo:
>>> s = 'öé'
>>> s.encode('Windows-1252').decode('UTF-8')
'öé'
Note though, that Windows-1252 is lossy. Some bytes used in UTF-8 don't have a definition in Windows-1252, so its possible that when written to your database they were dropped. It's also possible that ISO-8859-1
was used (aka latin1
) which is close to Windows-1252 but all bytes are defined.