Search code examples
pythonunicodeutf-8arabic

How to convert arabic character to its base glyph form in Python 3?


As a single arabic character can take on multiple glyph forms there are multiple unicode/utf-8 encoding for each form e.g Aleph: Isolated == ا with utf-8==\xD8\xA7, Final == ـا with utf-8==\xD9\x80\xD8\xA7, Hamza == أ / إ with utf-8==\xD8\xA5 / \xD8\xA3, Maddah == آ with utf-8==\xD8\xA2, Maqsurah == ى with utf-8==\xD9\x89, where the base form would be the isolated aleph with utf-8==\xD8\xA7.

How can I convert an arabic character to its base glyph form in Python 3?


Solution

  • You can use unicodedata.normalize to convert code points to their decomposed form, consisting of a base character and a modifier. It doesn't work for all cases (particularly Maqsurah), but could help you write a function to determine some base forms:

    >>> s='ـا' # this character already consisted of the base code point.
    >>> import unicodedata as ud
    >>> for c in s:
    ...     print(f'{c} U+{ord(c):04X} {ud.name(c)}')
    ...     
    ـ U+0640 ARABIC TATWEEL
    ا U+0627 ARABIC LETTER ALEF
    
    >>> s = 'أإآ' # These characters have decomposed forms
    >>> for c in s:
    ...     print(f'{c} U+{ord(c):04X} {ud.name(c)}')
    ...     
    أ U+0623 ARABIC LETTER ALEF WITH HAMZA ABOVE
    إ U+0625 ARABIC LETTER ALEF WITH HAMZA BELOW
    آ U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE
    >>> s = ud.normalize('NFD',s)
    >>> for c in s:
    ...     print(f'{c} U+{ord(c):04X} {ud.name(c)}')
    ...     
    ا U+0627 ARABIC LETTER ALEF
    ٔ  U+0654 ARABIC HAMZA ABOVE
    ا U+0627 ARABIC LETTER ALEF
    ٕ  U+0655 ARABIC HAMZA BELOW
    ا U+0627 ARABIC LETTER ALEF
    ٓ  U+0653 ARABIC MADDAH ABOVE