Search code examples
c#.netstringunicodestring-comparison

How to compare Unicode characters that "look alike"?


I fall into a surprising issue.

I loaded a text file in my application and I have some logic which compares the value having µ.

And I realized that even if the texts are same the compare value is false.

 Console.WriteLine("μ".Equals("µ")); // returns false
 Console.WriteLine("µ".Equals("µ")); // return true

In later line the character µ is copy pasted.

However, these might not be the only characters that are like this.

Is there any way in C# to compare the characters which look the same but are actually different?


Solution

  • In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character. You also need to consider if it's appropriate for your use case — see Jukka K. Korpela's comment.

    For this particular situation, if you refer to the links in Tony's answer, you'll see that the table for U+00B5 says:

    Decomposition <compat> GREEK SMALL LETTER MU (U+03BC)

    This means U+00B5, the second character in your original comparison, can be decomposed to U+03BC, the first character.

    So you'll normalize the characters using full compatibility decomposition, with the normalization forms KC or KD. Here's a quick example I wrote up to demonstrate:

    using System;
    using System.Text;
    
    class Program
    {
        static void Main(string[] args)
        {
            char first = 'μ';
            char second = 'µ';
    
            // Technically you only need to normalize U+00B5 to obtain U+03BC, but
            // if you're unsure which character is which, you can safely normalize both
            string firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD);
            string secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD);
    
            Console.WriteLine(first.Equals(second));                     // False
            Console.WriteLine(firstNormalized.Equals(secondNormalized)); // True
        }
    }
    

    For details on Unicode normalization and the different normalization forms refer to System.Text.NormalizationForm and the Unicode spec.