Search code examples
pythonperformanceoptimizationnlparabic

Filtering list of Arabic sentences based on language test: Why so slow?


I'm trying to go through a list of (mostly) Arabic sentences, and remove those that are not Arabic. I've got a hack for telling if a character is Arabic or not: Arabic has no case, so if the character is alpha but isn't upper case or lower case, it's Arabic.

I've got the code below, which works, but the language identification part is very slow, compared to the other filter. It doesn't seem to me like it's doing anything particularly complex, so I don't understand why it's taking so long. (The corpus is size is about 300K sentences before filtering.)

Is there something I can do to make it more efficient?

Thanks!

def test_lang(string):
    """Takes a string and determines if it is written in Arabic 
    characters or foreign, by testing whether the first character 
    has a case attribute. This is intended for Arabic texts that  
    may have English or French words added. If it encounters another 
    case-less language (Chinese for instance), it will falsely 
    identify it as Arabic."""

    if not string or not string.isalpha():
        return None
    char = string[0]
    if char.isalpha() and not (char.islower() or char.isupper()):
        lang = 'AR'
    else:
        lang = 'FW'
    return lang

...

# remove sentences that are in English or French - THIS IS SLOW (takes a few mins)
for sent in sents:
    if sent and test_lang(sent[0]) != 'AR':
        sents.remove(sent)

# remove clearly MSA sentences -- THIS IS FAST (takes a few seconds)
msa_features = ['ليس','لست','ليست','ليسوا','الذي','الذين','التي','ماذا', 'عن']
p = re.compile('|'.join(msa_features))
for sent in sents:
    if re.search(p, sent):
        sents.remove(sent)

Solution

  • list.remove is extremely slow for this purpose - it searches the entire list for the given value each time, and then deletes it. It has to effectively iterate through the entire list for each element that is removed, causing quadratic runtime.

    A better solution here would be the following list expression:

    sents = [
        sent for sent in sents
        if test_lang(sent[0]) == 'AR' and not re.search(p, sent)
    ]
    

    This filters the list in linear time.

    (I would guess that the first filter has to operate on a very long list and discards most of it? While the second one receives a much smaller list and doesn't have to remove very much? This would explain why the first one is so much slower.)