Search code examples
javascriptstringstring-comparison

localeCompare returns 0 for different unicode symbols


I am looking to use localeCompare for strictly sorting strings but I am finding it is returning 0 when given two different unicode characters which erroneously indicates they are the same e.g.

ℜ U+211C (alt-08476) BLACK-LETTER CAPITAL R = real part

ℝ U+211D (alt-08477) DOUBLE-STRUCK CAPITAL R = the set of real numbers

"ℜ".localeCompare("ℝ", "en")   
> 0

"ℜ" === "ℝ"                    
> false

"ℜ".charCodeAt(0)
> 8476

"ℝ".charCodeAt(0)
> 8477

I have looked at the docs but the defaults are already for "sort" and "variant" which appear to be the strictest available:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Collator/Collator

Is localeCompare unable to give a strict a ordering?


Solution

  • It seems that after detecting that they are both non-ASCII versions of uppercase letter R, String.localeCompare() correctly specifies that there is no particular distinction in order between the two characters.

    console.log(
      // two non-0x43 uppercase Cs
      'ℂ'.localeCompare('𝑪', 'en'),
    
      // two non-0x5A uppercase Zs
      "ℤ".localeCompare('𝗭', 'en'),
      
      // 0x5A ASCII Z precedes both:
      "Z".localeCompare('ℤ', 'en'),
      "Z".localeCompare('𝗭', 'en'),
    );

    You can use unicode position in places where there is no defined sort order due to canonical equivalence:

    const sort = (a, b) => a.localeCompare(b) || -(a < b);
    
    console.log(
      //  1 (C < 𝑪 in localeCompare)
      sort('𝑪', 'C'),
      // -1 (Canonically equivalent; falls back to 0x2102 < 0xD835)
      sort('ℂ', '𝑪')  
    );

    From the ECMAScript spec:

    The actual return values are implementation-defined to permit implementers to encode additional information in the value, but the function is required to define a total ordering on all Strings and to return 0 when comparing Strings that are considered canonically equivalent by the Unicode standard.

    From the Wikipedia article on Unicode equivalence:

    Unicode provides two such notions, canonical equivalence and compatibility. Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed.

    For example, the code point U+006E (the Latin lowercase n) followed by U+0303 (the combining tilde ◌̃) is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter ñ of the Spanish alphabet).

    Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other. Similarly, each Hangul syllable block that is encoded as a single character may be equivalently encoded as a combination of a leading conjoining jamo, a vowel conjoining jamo, and, if appropriate, a trailing conjoining jamo.

    Unicode equivalence example.

    See also: https://unicode.org/reports/tr10/