Search code examples
c#encodingasciifallback

Is there a such a thing like "user-defined encoding fallback"


When using ASCII encoding and encoding strings to bytes, characters like ö will result to ?.

Encoding encoding = Encoding.GetEncoding("us-ascii");     // or Encoding encoding = Encoding.ASCI;
data = encoding.GetBytes(s);

I'm searching for a way to replace those characters by different ones, not just a question mark.
Examples:

ä -> ae
ö -> oe
ü -> ue
ß -> ss

If it's not possible to replace one character by multiple, I will accept if I can even replace them by one character (ö -> o)

Now there are several implementations of EncoderFallback, but I don't understand how they work.
A quick and dirty solution would be to replace all those characters before giving the string to Encoding.GetBytes(), but that doesn't seems to be the "right" way.
I wish I could give a table of replacements to the encoding object.

How can I accomplish this?


Solution

  • The "most correct" way to achieve what you want is to implement a custom fallback encoder that does a best-fit fallback. The one built in to .NET, for various reasons, is pretty conservative in what characters it will try to best-fit (there are security implications, depending on what use you plan to put the re-encoded string.) Your custom fallback strategy could do best-fit based on whatever rules you want.

    Having said that - in your fallback class, you're going to end up writing a giant case statement of all the non-encode-able Unicode code points and manually mapping them to their best-fit alternatives. You can achieve the same goal by simply looping through your string ahead of time and swapping out the unsupported characters for replacements. The main benefit of the fallback strategy is performance: you only end up looping through your string once, instead of at least twice. Unless your strings are huge, though, I wouldn't worry too much about it.

    If you do want to implement a custom fallback strategy, you should definitely read the article in my comment: Character Encoding in the .NET Framework. It's not really hard, but you have to understand how the encoding fallback works.

    You provide the Encoder.GetEncoding method an implementation of your custom class, which has to derive from EncoderFallback. That class, though, is basically just a wrapper around the real work, which is done in EncoderFallbackBuffer. The reason you need a buffer is because fallback is not necessarily a one-to-one process; in your example, you may end up mapping a single Unicode character to two ASCII characters.

    At the point where the encoding process first runs into a problem and needs to fall back on your strategy, it uses your EncoderFallback implementation to create an instance of your EncoderFallbackBuffer. It then calls the Fallback method of your custom buffer.

    Internally, your buffer builds up a set of characters to be returned in place of the non-encode-able one, and returns true. From there, the encoder will call GetNextChar repeatedly as long as Remaining > 0 and/or until GetNextChar returns CP 0, and stick those characters into the encoded result.

    The article includes an implementation of pretty much exactly what you're trying to do; I've copied out the basic framework below, which should get you started.

    public class CustomMapper : EncoderFallback
    {
       // Use can override the "replacement character", so track what they
       // give us.
       public string DefaultString;
    
       public CustomMapper() : this("*")
       {   
       }
    
       public CustomMapper(string defaultString)
       {
          this.DefaultString = defaultString;
       }
    
       public override EncoderFallbackBuffer CreateFallbackBuffer()
       {
          return new CustomMapperFallbackBuffer(this);
       }
    
       // This is the length of the largest possible replacement string we can
       // return for a single Unicode code point.
       public override int MaxCharCount
       {
          get { return 2; }
       } 
    }
    
    public class CustomMapperFallbackBuffer : EncoderFallbackBuffer
    {
       CustomMapper fb; 
    
       public CustomMapperFallbackBuffer(CustomMapper fallback)
       {
          // We can use the same custom buffer with different fallbacks, e.g.
          // we might have different sets of replacement characters for different
          // cases. This is just a reference to the parent in case we want it.
          this.fb = fallback;
       }
    
       public override bool Fallback(char charUnknown, int index)
       {
          // Do the work of figuring out what sequence of characters should replace
          // charUnknown. index is the position in the original string of this character,
          // in case that's relevant.
    
          // If we end up generating a sequence of replacement characters, return
          // true, and the encoder will start calling GetNextChar. Otherwise return
          // false.
    
          // Alternatively, instead of returning false, you can simply extract
          // DefaultString from this.fb and return that for failure cases.
       }
    
       public override bool Fallback(char charUnknownHigh, char charUnknownLow, int index)
       {
          // Same as above, except we have a UTF-16 surrogate pair. Same rules
          // apply: if we can map this pair, return true, otherwise return false.
          // Most likely, you're going to return false here for an ASCII-type
          // encoding.
       }
    
       public override char GetNextChar()
       {
          // Return the next character in our internal buffer of replacement
          // characters waiting to be put into the encoded byte stream. If
          // we're all out of characters, return '\u0000'.
       }
    
       public override bool MovePrevious()
       {
          // Back up to the previous character we returned and get ready
          // to return it again. If that's possible, return true; if that's
          // not possible (e.g. we have no previous character) return false;
       }
    
       public override int Remaining 
       {
          // Return the number of characters that we've got waiting
          // for the encoder to read.
          get { return count < 0 ? 0 : count; }
       }
    
       public override void Reset()
       {
           // Reset our internal state back to the initial one.
       }
    }