Search code examples
.netutf-8utf-7

Intelligent UTF-8 to UTF-7 in .NET


If I have a string of UTF-8 characters and they need to be output to an older system as UTF-7 I have two questions pertaining to this.

  1. How can I convert a string s which has UTF-8 characters to the same string without those characters efficiently?

  2. Are there any simple of converting extended characters like 'Ō' to their closest non extended equivalent 'O'?


Solution

  • If the older system can actually handle UTF-7 properly, why do you want to remove anything? Just encode the string as UTF-7:

    string text = LoadFromWherever(Encoding.UTF8);
    byte[] utf7 = Encoding.UTF7.GetBytes(text);
    

    Then send the UTF-7-encoded text down to the older system.

    If you've got the original UTF-8-encoded bytes, you can do this in one step:

    byte[] utf7 = Encoding.Convert(Encoding.UTF8, Encoding.UTF7, utf8);
    

    If you actually need to convert to ASCII, you can do this reasonably easily.

    To remove the non-ASCII characters:

    var encoding = Encoding.GetEncoding
        ("us-ascii", new EncoderReplacementFallback(""), 
         new DecoderReplacementFallback(""));
    byte[] ascii = encoding.GetBytes(text);
    

    To convert non-ASCII to nearest equivalent:

    string normalized = text.Normalize(NormalizationForm.FormKD);
    var encoding = Encoding.GetEncoding
        ("us-ascii", new EncoderReplacementFallback(""), 
         new DecoderReplacementFallback(""));
    byte[] ascii = encoding.GetBytes(normalized);