Search code examples

Unicode composition in javascript

I am looking for a way to count ligatures as single units as they are displayed to user, e.g.

When this character is typed (type G on Arabic keyboard), it's inserted in decomposition form, i.e. U+0644 U+0627.

I'm able to decompose U+FEFB by

escape(String.fromCodePoint(0xFEFB).normalize("NFKD")) // '%u0644%u0627'

Is there a way to compose U+0644 U+0627 into 0xFEFB?

Why this does work?

escape(String.fromCodePoint(0x0644, 0x0627).normalize("NFKC"))

The only idea I has was to iterate over unicode ranges I'm interested in, decompose and create a map, but I'm hoping there's a better way.


  • Given that the ES2019 spec requires the implementation to:

    Let ns be the String value that is the result of normalizing S into the normalization form named by f as specified in

    and given that describes that character as

    FEFB;FEFB;FEFB;0644 0627;0644 0627; # (ﻻ; ﻻ; ﻻ; لا; لا; ) ARABIC LIGATURE LAM WITH ALEF ISOLATED FORM

    it is the compliant behaviour. See

    # 1. The following invariants must be true for all conformant implementations
    #    NFC
    #      c2 ==  toNFC(c1) ==  toNFC(c2) ==  toNFC(c3)
    #      c4 ==  toNFC(c4) ==  toNFC(c5)
    #    NFD
    #      c3 ==  toNFD(c1) ==  toNFD(c2) ==  toNFD(c3)
    #      c5 ==  toNFD(c4) ==  toNFD(c5)
    #    NFKC
    #      c4 == toNFKC(c1) == toNFKC(c2) == toNFKC(c3) == toNFKC(c4) == toNFKC(c5)
    #    NFKD
    #      c5 == toNFKD(c1) == toNFKD(c2) == toNFKD(c3) == toNFKD(c4) == toNFKD(c5)

    No normalisation converts either c4 or c5 form back to c1, or c2, or c3.

    So to my unicode-amateur opinion there is no standard-compliant way to normalise U+0644 U+0627 back to U+FEFB.