After many attempts of understanding I gotta say that I don't get how String.prototype.normalize()
works. There are a few values that this method can take as parameters: NFC
, NFD
, NFKC
, NFKD
.
Firstly, I don't get what's the difference between NFD
and NFKD
. The spec is very vague about that, so... In some resource I've read that NFD
decomposes characters by canonical equivalence. For example:
"â" (U+00E2) -> "a" (U+0061) + " ̂" (U+0302)
And NFKD
decomposes characters by compatibility. For example:
"fi" (U+FB01) -> "f" (U+0066) + "i" (U+0069)
But that's not exactly true. NFKD
not only decomposes characters by compatibility. Also it perfectly can deal with the first example:
let s = `\u00E2`; //"â"
console.log(s.normalize('NFD').length); //2
console.log(s.normalize('NFKD').length); //2
Does it mean that NFKD
can decompose characters by compatibility and also canonical equivalence? And NFD
decomposes characters only by canonical equivalence...?
let s = `\uFB01`; //"fi"
console.log(s.normalize('NFD').length); //1
The type of full decomposition chosen depends on which Unicode Normalization Form is involved. For NFC or NFD, one does a full canonical decomposition, which makes use of only canonical Decomposition_Mapping values. For NFKC or NFKD, one does a full compatibility decomposition, which makes use of canonical and compatibility Decomposition_Mapping values.
That's why NFC/NFD and NFKC/NFKD work like that:
let s1 = '\uFB00'; //"ff"
let s2 = '\u0066\u0066'; //"ff"
console.log(s1.normalize('NFD').length); //doesn't work with compatible -- only can. eq.
let t1 = `\u00F4`; //ô
let t2 = `\u006F\u0302`; //ô
console.log(t1.normalize('NFKD').length); //also works with can. eq.
console.log(t2.normalize('NFKC').length); //also works with can. eq.
And that's complitly understandable because...
All canonically equivalent sequences are also compatible, but not vice versa.