Search code examples
c#encodehtml-encode

Ideographic space encoding


I want to exclude the ideographic space on our encoding but it's not working.

string a = "A B";
var encoder = HtmlEncoder.Create(allowedRanges: new[] { UnicodeRanges.BasicLatin, new UnicodeRange(3000, 1) });

Console.WriteLine(encoder.Encode(a));

Output is

A B

I'm expecting to show as a space only. The reason is i'm sending it on another application and they want to it to receive as it is.

A B

Solution

  • First, IDEOGRAPHIC SPACE's code point is 3000, but that is in hex, so you should have written:

    new UnicodeRange(0x3000, 1)
    

    However, this won't fix the problem.

    If you look at the "Remarks" of the documentation for Create, you'll see:

    Some characters in allowedRanges might still be encoded; that is, this parameter indicates what ranges the encoder is allowed to not encode, not what characters it must not encode.

    That sucks, doesn't it?

    If we have a look at the reference source, we see that there is a comment specifically saying to forbid all characters in certain categories (this constructor is called by Create):

    public DefaultHtmlEncoder(TextEncoderSettings settings)
        {
            if (settings == null)
            {
                throw new ArgumentNullException(nameof(settings));
            }
    
            _allowedCharacters = settings.GetAllowedCharacters();
    
            // Forbid codepoints which aren't mapped to characters or which are otherwise always disallowed
            // (includes categories Cc, Cs, Co, Cn, Zs [except U+0020 SPACE], Zl, Zp)
            _allowedCharacters.ForbidUndefinedCharacters();
    

    We see that all the characters in Zs except the 0x20 space is forbidden. Since this is written in the source code, and after the line _allowedCharacters = settings.GetAllowedCharacters();, you can't change the behaviour no matter how you change the settings.

    So in conclusion, you can't use HtmlEncoder to do this. You'd have to use something else.


    The old WebUtility.HtmlEncode seems to not encode ideographic space, but it also doesn't encode other spaces... Maybe that is useful to you?