I am trying to normalize a string (using .net standard 2.0) using Form D, and it works perfectly and running on a Windows machine.
[TestMethod]
public void TestChars()
{
var original = "é";
var normalized = original.Normalize(NormalizationForm.FormD);
var originalBytesCsv = string.Join(',', Encoding.Unicode.GetBytes(original));
Assert.AreEqual("233,0", originalBytesCsv);
var normalizedBytesCsv = string.Join(',', Encoding.Unicode.GetBytes(normalized));
Assert.AreEqual("101,0,1,3", normalizedBytesCsv);
}
When I run this on Linux, it returns "253,255" for both strings, before and after normalization. These two bytes form the word 65533 which is the Unicode Replacement char, used when something goes wrong with encoding. That's the part where I am lost.
What am I missing here? Is there someone to point me in the right direction?
It might be related to the encoding of the source file. I'm not sure which encodings .net on Linux supports, but to be on the safe side, you should use plain ASCII source files and Unicode escapes for Non-ASCII characters:
var original = "\u00e9";