I have an Excel file, which has tens of thousands of English/Latin and Arabic words in two columns, first column's name: "EN", the another column's name: "AR". The column I want to work on is "AR" column.
I want to add 'ar' in a new column in front of each row containing only Arabic words, and add 'en' in front of each row contains only Latin vocabulary, and add 'enar' in front of each row contains Latin and Arabic vocabulary.
Note: numbers, point '.', comma ',' are used in all rows.
An example of my file, the work I want to do:
EN AR new column
Appel تفاحة ar
Appel (1990) (1990) تفاحة ar
R. Appel ر. تفاحة ar
Red, Appel Red Appel en
Red Appel Red Appel en
R. Appel R. Appel en
Red, Appel تفاحة، Red enar
Red Appel Red تفاحة enar
How can I do that using Python/Pandas?
Thank you guys for your help.
Here is a possible solution with a third party library called regex
.
Code
import pandas as pd
import regex
data = {'AR':[' تفاحة ','(1990) تفاحة', 'ر. تفاحة', 'Red Appel', 'Red Appel', 'R. Appel', 'تفاحة، Red', 'Red تفاحة']}
df = pd.DataFrame(data)
df['is_arabic'] = df['AR'].apply(lambda t: True if regex.search(r'[^\p{Latin}\W]', t) else False)
df['is_latin'] = df['AR'].apply(lambda t: True if regex.search(r'[\p{Latin}a-zA-Z]', t) else False)
#assign 'enar', 'ar', 'en'
def myfunc(t):
if t[0]&t[1]:
return 'enar'
elif t[0]:
return 'ar'
else:
return 'en'
df['new_column'] = df[['is_arabic','is_latin']].apply(myfunc, axis=1)
Output
#print(df)
# AR is_arabic is_latin new_column
# 0 تفاحة True False ar
# 1 (1990) تفاحة True False ar
# 2 ر. تفاحة True False ar
# 3 Red Appel False True en
# 4 Red Appel False True en
# 5 R. Appel False True en
# 6 تفاحة، Red True True enar
# 7 Red تفاحة True True enar