Search code examples
pythonunicodearabicphoneticsgrapheme

How to map Arabic letters to phonemes in Python?


I want to make a simple Python script that will map each Arabic letter to phoneme sound symbols. I have a file that has a bunch of words that the script will read to convert them to phonemes, and I have the following dictionary in my code:

Content in my .txt file:

السلام عليكم
السلام عليكم و رحمة الله
السلام عليكم و رحمة الله و بركاته
الحمد لله
كيف حالك
كيف الحال

The dictionary in my code:

ar_let_phon_maplist = {u'ﺍ':'A:', u'ﺏ':'B', u'ﺕ':'T', u'ﺙ':'TH', u'ﺝ':'J', u'ﺡ':'H', u'ﺥ':'KH', u'ﻩ':'H', u'ﻉ':'(ayn) ’', u'ﻍ':'GH', u'ﻑ':'F', u'ﻕ':'q', u'ﺹ':u'ṣ', u'ﺽ':u'ḍ', u'ﺩ':'D', u'ﺫ':'DH', u'ﻁ':u'ṭ', u'ﻙ':'K', u'ﻡ':'M', u'ﻥ':'N', u'ﻝ':'L', u'ﻱ':'Y', u'ﺱ':'S', u'ﺵ':'SH', u'ﻅ':u'ẓ', u'ﺯ':'Z', u'ﻭ':'W', u'ﺭ':'R'}

I have a nested loop where I'm reading each line, converting each character:

with codecs.open(sys.argv[1], 'r', encoding='utf-8') as file:
        lines = file.readlines()

line_counter = 0

for line in lines:
        print "Phonetics In Line " + str(line_counter)
        print line + " ",
        for word in line:
                for character in word:
                        if character == '\n':
                                print ""
                        elif character == ' ':
                                print "  "
                        else:
                                print ar_let_phon_maplist[character] + " ",
line_counter +=1

And this is the error I'm getting:

Phonetics In Line 0
السلام عليكم

Traceback (most recent call last):
  File "grapheme2phoneme.py", line 25, in <module>
    print ar_let_phon_maplist[character] + " ",
KeyError: u'\u0627'

And then I checked if the file type is UTF-8 using the Linux command:

file words.txt

The output I got:

words.txt: UTF-8 Unicode text

Any solution for this problem, why it's not mapping to an Unicode object that is in the dictionary since also the character I'm using as key in ar_let_phon_maplist[character] line is Unicode? Is there something wrong with my code?


Solution

  • The first thing that catches the eye is KeyError. So your dictionary simply does not know about some symbols encountered in file. Looking ahead, it does not know about ANY of the submitted characters, not only about the first.

    What we can to do with it? Okay, we can just add all of the symbols from Arabian segment of unicode table into our dictionary. Simple? Yes. Clear? No.

    If you want to actually understand the reasons of this 'strange' behaviour, you should to know more about Unicode. In short, there are a lot of letters that looks similar but have different ordinal numbers. Moreover, the same letter sometimes can be presented in multiple forms. So comparing unicode characters is not a trivial task.

    So, if I was allowed to use Python 3.3+ I would solve the task as follows. First I'll normalize keys in ar_let_phon_maplist dictionary:

    ar_let_phon_maplist = {unicodedata.normalize('NFKD', k): v 
                                for k, v in ar_let_phon_maplist.items()}
    

    And then we will iterate over lines in file, words in line and characters in word like this:

    for index, line in enumerate(lines):
        print('Phonetics in line {0}, total {1} symbols'.format(index, len(line)))
        unknown = []  # Here will be stored symbols that we haven't found in dict
        words = line.split()
        for word in words:
            print(word, ': ', sep='', end='')
            for character in word:
                c = unicodedata.normalize('NFKD', character).casefold()
                try:                
                    print(ar_let_phon_maplist[c], sep='', end='')
                except KeyError:
                    print('_', sep='', end='')
                    if c not in unknown:
                        unknown.append(c)
            print()
        if unknown:
            print('Unrecognized symbols: {0}, total {1} symbols'.format(', '.join(unknown), 
                                                                        len(unknown)))
    

    Script will produce something like that:

    Phonetics in line 4, total 9 symbols
    كيف: KYF
    حالك: HA:LK