Search code examples
sql-serverutf-8windows-1252

List of character encoding conversions required : windows-1252 to UTF-8


I have an SQL Server database full of badly-encoded characters. The issue was the ETL was written in Java with the wrong default encoding, and was therefore directly loading the badly-encoded characters. The database is literally holding the characters ö instead of ö

I've fixed the issue with the ETL, now I would like to fix the existing data. I know I can use REPLACE() to replace the characters, but my issue is that I do not have a definitive list of characters and what they're "supposed" to be.

I have scoured the internet for some sort of definitive list but I cannot find one. I just need a big list of this:

ö > ö

é > é

etc

EDIT: I first thought the "bad" encoding was ISO 8859-15, actually I just checked the previous default charset and it is windows-1252. I have updated the title of the post.


Solution

  • It's a big list. UTF-8 encodes all of Unicode in one to four bytes per Unicode code point. There are 1,114,111 Unicode code points (1,112,064 if UTF-16 surrogates are discounted).

    It is easier to read the strings using Windows-1252, encode them to bytes using Windows-1252, decode them with UTF-8. Example in Python since its what I have available to demo:

    >>> s = 'öé'
    >>> s.encode('Windows-1252').decode('UTF-8')
    'öé'
    

    Note though, that Windows-1252 is lossy. Some bytes used in UTF-8 don't have a definition in Windows-1252, so its possible that when written to your database they were dropped. It's also possible that ISO-8859-1 was used (aka latin1) which is close to Windows-1252 but all bytes are defined.