Search code examples
c#unicode

Detecting special character in C# and replacing it with normal ASCII one


I am reading file in C# that sometime contain special characters. As you can see in the screenshot given below, there is word Zürich but instead of ü it is showing some special character.

Is it possible to detect what that special character is and replace it with equivalent English character for e.g. in this case replacing it with u?

enter image description here


Solution

  • Is it possible to detect what that special character is

    Yes, but also no. You have to know in which encoding the file is written to get it right. The screenshot looks like Notepad++ (Scintilla), and Notepad++ has an encoding menu. Switch encodings until the characters look good.

    You cannot reasonably do this through code. Sure, you can guess and do stuff like character frequency analysis, but an ü is stored as a certain number (code point, byte value(s)), and this number represents an ü in one or more code pages (encodings).

    Obligatory: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (it's 20 years old already!).

    So if you know that the file is saved with a certain encoding (and it does not seem to be UTF-8, Windows-1252 or ISO-8859-1), then read it with the given encoding: C# FileStream read set encoding.

    If, on the other hand, the file actually contains ("'WHITE SQUARE' (U+25A1)", "may be used to represent a missing ideograph"), then the original information (the code point of ü) is lost.

    tl;dr: guess the encoding using your text editor's functionalities, or look (using a hex editor) what the numeric value of the character is, then find the appropriate encoding.