Search code examples
c#stringunicodereplacesurrogate-pairs

Is String.Replace(string,string) Unicode Safe in regards to Surrogate Pairs?


I am trying to figure out the best way to create a function that is equivalent to String.Replace("oldValue","newValue"); that can handle surrogate pairs.

My concern is that if there are surrogate pairs in the string and there is the possibility of a string that matches part of the surrogate pair that it would potentially split the surrogate and have corrupt data.

So my high level question is: Is String.Replace(string oldValue, string newValue); a safe operation when it comes to Unicode and surrogate pairs?

If not, what would be the best path forward? I am familiar with the StringInfo class that can split these strings into elements and such. I'm just unsure of how to go about the replace when passing in strings for the old and new values.

Thanks for the help!


Solution

  • It's safe, because strings in .NET are internally UTF-16. Unicode code point can be represented by one or two UTF-16 code units, and .NET char is one such code unit.

    When code point is represented by two units, first unit is called high surrogate, and second is called low surrogate. What's important in context of this question is surrogate units belong to specific range, U+D800 - U+DFFF. This range is used only to represent surrogate pairs, single unit in this range has no meaning and is invalid.

    For that reason, it's not possible to have valid utf-16 string which matches "part" of surrogate pair in another valid utf-16 string.

    Note that .NET string can also represent invalid utf-16 string. If any argument to Replace is invalid - then it can indeed split surrogate pair. But - garbage in, garbage out, so I don't consider this a problem in given case.