I'm trying to go through a list of (mostly) Arabic sentences, and remove those that are not Arabic. I've got a hack for telling if a character is Arabic or not: Arabic has no case, so if the character is alpha but isn't upper case or lower case, it's Arabic.
I've got the code below, which works, but the language identification part is very slow, compared to the other filter. It doesn't seem to me like it's doing anything particularly complex, so I don't understand why it's taking so long. (The corpus is size is about 300K sentences before filtering.)
Is there something I can do to make it more efficient?
Thanks!
def test_lang(string):
"""Takes a string and determines if it is written in Arabic
characters or foreign, by testing whether the first character
has a case attribute. This is intended for Arabic texts that
may have English or French words added. If it encounters another
case-less language (Chinese for instance), it will falsely
identify it as Arabic."""
if not string or not string.isalpha():
return None
char = string[0]
if char.isalpha() and not (char.islower() or char.isupper()):
lang = 'AR'
else:
lang = 'FW'
return lang
...
# remove sentences that are in English or French - THIS IS SLOW (takes a few mins)
for sent in sents:
if sent and test_lang(sent[0]) != 'AR':
sents.remove(sent)
# remove clearly MSA sentences -- THIS IS FAST (takes a few seconds)
msa_features = ['ليس','لست','ليست','ليسوا','الذي','الذين','التي','ماذا', 'عن']
p = re.compile('|'.join(msa_features))
for sent in sents:
if re.search(p, sent):
sents.remove(sent)
list.remove
is extremely slow for this purpose - it searches the entire list for the given value each time, and then deletes it. It has to effectively iterate through the entire list for each element that is removed, causing quadratic runtime.
A better solution here would be the following list expression:
sents = [
sent for sent in sents
if test_lang(sent[0]) == 'AR' and not re.search(p, sent)
]
This filters the list in linear time.
(I would guess that the first filter has to operate on a very long list and discards most of it? While the second one receives a much smaller list and doesn't have to remove very much? This would explain why the first one is so much slower.)