Search code examples
c#character-encodingasciidiacriticsutf

Neither ASCII or UTF8 can decode French characters, what should I do?


I have the following function:

private void ReceivedData(byte[] data)
{
    string info = Encoding.ASCII.GetString(data);

When I use this, then the data, containing an é character, replace that character by a question mark (?).

For your information, the data looks as follows in Visual Studio's Watch window (the mentioned character is found back in data[27] and data[28]):

enter image description here

For your information: when I type ALT+0233 on my computer, I see the mentioned é character.

When I replace ASCII encoding by UTF8 encoding (as suggested on some websites or some answers here on the site), I get some weird characters, containing question marks (��, or in an image enter image description here):

private void ReceivedData(byte[] data)
{
    string info = Encoding.UTF8.GetString(data);

Which encoding should I use for correctly decode French characters?

Thanks in advance


Solution

  • Looks like a Win-1252 encoding (which is for various Latin characters with diacritics),

    // In case you work with .Net Core you have to enable code pages (1252)
    Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
    
    byte[] data = {
      95, 233, 233, 110
    };
    
    var result = Encoding.GetEncoding(1252).GetString(data);
    
    Console.Write(result);
    

    Output:

    _één
    

    Edit: In general case, when facing unknown encoding you can try quering all the encodings available and inspect the results:

    using System.Linq;
    using System.Text;
    
    ...
    
    // Enable code pages for .net core
    Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
    
    byte[] data = {
      95, 233, 233, 110
    };
    
    var report = string.Join(Environment.NewLine, Encoding
      .GetEncodings()
      .OrderBy(encoder => encoder.Name, StringComparer.OrdinalIgnoreCase)
      .Select(encoder => (name: encoder.Name, text: encoder.GetEncoding().GetString(data)))
      .Where(pair => pair.text.Contains('é')) // at least one é must be present
      .Select(pair => $"{pair.name,-30} : {pair.text}"));
    
    Console.Write(report);
    

    Output:

    iso-8859-1                     : _één
    iso-8859-13                    : _één
    iso-8859-15                    : _één
    iso-8859-2                     : _één
    iso-8859-3                     : _één
    iso-8859-4                     : _één
    iso-8859-9                     : _één
    windows-1250                   : _één
    windows-1252                   : _één <- The most probabale (IMHO) encoding
    windows-1254                   : _één
    windows-1256                   : _één
    windows-1257                   : _één
    windows-1258                   : _één