I got two different "versions" of Arabic letters on Wikipedia. The first example seems to be 3 sub-components in one:
"ـمـ".split('').map(x => x.codePointAt(0).toString(16))
[ '640', '645', '640' ]
Finding this "m medial" letter on this page gives me this:
ﻤ
fee4
The code points 640 and 645 are the "Arabic tatwheel" ـ
and "Arabic letter meem" م
. What the heck? How does this work? I don't see anywhere in the information so far on Unicode Arabic how these glyphs are "composed". Why is it composed from these parts? Is there a pattern for the structure of all glyphs? (All the glyphs on the first Wikipedia page are similar, but the second they are one code point). Where do I find information on how to parse out the characters effectively in Arabic (or any other language for that matter)?
Arabic is a script with cursive joining; the shape of the letters changes depending on whether they occur initially, medially, or finally within a word. Sometimes you may want to display these contextual forms in isolation, for example to simply show what they look like.
The recommended way to go about this is by using special join-causing characters for the letters to connect to. One of these is the tatweel (also called kashida), which is essentially a short line segment with “glue” at each end. So if you surround the letter م with a tatweel character on both sides, the text renderer automatically selects its medial form as if it occured in the middle of a word (ـمـ). The underlying character code of the م doesn’t change, only its visible glyph.
However, for historical reasons Unicode also contains a large set of so-called presentation forms for Arabic. These represent those same contextual letter shapes, but as separate character codes that do not change depending on their surroundings; putting the “isolated” presentation form of م between two tatweels does not affect its appearance, for instance: ـﻡـ
It is not recommended to use these presentation forms for actually writing Arabic. They exist solely for compatibility with old legacy encodings and aren’t needed for correctly typesetting Arabic text. Wikipedia just used them for demonstration purposes and to show off that they exist, I presume. If you encounter presentation forms, you can usually apply Unicode normalisation (NFKD or NFKC) to the string to get the underlying base letters. See the Unicode FAQ on presentation forms for more information.