Search code examples
python-3.xpandasutf-8xlsx

Add a tag to a new column after reading the column content


I have an Excel file, which has tens of thousands of English/Latin and Arabic words in two columns, first column's name: "EN", the another column's name: "AR". The column I want to work on is "AR" column.

I want to add 'ar' in a new column in front of each row containing only Arabic words, and add 'en' in front of each row contains only Latin vocabulary, and add 'enar' in front of each row contains Latin and Arabic vocabulary.

Note: numbers, point '.', comma ',' are used in all rows.

An example of my file, the work I want to do:

    EN                       AR                new column
    Appel                        تفاحة               ar
    Appel (1990)             (1990) تفاحة            ar
    R. Appel                 ر. تفاحة                ar
    Red, Appel               Red Appel                en
    Red Appel                Red Appel                en
    R. Appel                 R. Appel                 en
    Red, Appel               تفاحة، Red              enar
    Red Appel                Red تفاحة               enar

How can I do that using Python/Pandas?

Thank you guys for your help.


Solution

  • Here is a possible solution with a third party library called regex.

    Code

    import pandas as pd
    import regex
    
    data = {'AR':['    تفاحة ','(1990) تفاحة', 'ر. تفاحة', 'Red Appel', 'Red Appel', 'R. Appel', 'تفاحة، Red', 'Red تفاحة']}
    
    df = pd.DataFrame(data)
    
    df['is_arabic'] = df['AR'].apply(lambda t: True if regex.search(r'[^\p{Latin}\W]', t) else False)
    
    df['is_latin'] = df['AR'].apply(lambda t: True if regex.search(r'[\p{Latin}a-zA-Z]', t) else False)
    
    #assign 'enar', 'ar', 'en'
    def myfunc(t):
        if t[0]&t[1]:
            return 'enar'
        elif t[0]:
            return 'ar'
        else:
            return 'en'
    
    df['new_column'] = df[['is_arabic','is_latin']].apply(myfunc, axis=1)
    
    

    Output

    #print(df)
    #              AR  is_arabic  is_latin new_column
    # 0        تفاحة        True     False         ar
    # 1  (1990) تفاحة       True     False         ar
    # 2      ر. تفاحة       True     False         ar
    # 3     Red Appel      False      True         en
    # 4     Red Appel      False      True         en
    # 5      R. Appel      False      True         en
    # 6    تفاحة، Red       True      True       enar
    # 7     Red تفاحة       True      True       enar