Search code examples
pythondataframenlptext-mininglanguage-detection

How to extract only those rows of the DataFrame where the values of two columns of the DataFrame are in English Language?


I have a dataframe which has 27 columns including columns FonctionsStagiaire and ExigencesParticulieres. The dataframe has 13774 rows which are either entirely in English or French. The csv file can be found here: GDrive link

I am trying to keep only those rows of the dataframe where the values or contents of FonctionsStagiaire or ExigencesParticulieres are in English. I want to drop the entire rows where these columns contain values in French language.

I am using langdetect but is getting the error langdetect.lang_detect_exception.LangDetectException: No features in text. I checked out all the solutions on SO to resolve this error but nothing is working.

I also want to check for if these 2 columns contain only NaN values, only numeric or whitespace characters or only punctuation marks, etc. which can't be detected whether it is in English or French and might cause error while using langdetect.

When I am trying to see which row is throwing the error based on this SO question it is showing that This row throws error for every row.

I intend to use langdetect but any other solution would also be helpful as long as it works properly.

Any help is much appreciated.

I am trying with this code:

from langdetect import detect
import pandas as pd
import re
import string

def filter_nonenglish(df):
     
    list = ['FonctionsStagiaire', 'ExigencesParticulieres']

    for col in list:       
        #to check if values are only whitespaces or only numeric characters or only punctuation marks
        #something like this -> if (df[col].apply(lambda x: x.isnumeric()) == True) | (df[col].apply(lambda x: x.isspace()) == True) | (all(i in string.punctuation for i in df[col]) == True):
            return False     
        else:
            #trying to use apply() to apply detect() to each <str> values of the columns and not to the entire pd.Series object
            new_df = df[(df[col].apply(detect).eq('en'))]

    return new_df
   
    #df['FonctionsStagiaire'] = df['FonctionsStagiaire'].apply(detect)
    #df['ExigencesParticulieres'] = df['ExigencesParticulieres'].apply(detect)
    
    #df = df[df['FonctionsStagiaire'] == 'en']
    #df = df[df['ExigencesParticulieres'] == 'en']
    
    #new_df = df[(df.FonctionsStagiaire.apply(detect).eq('en')) & (df.ExigencesParticulieres.apply(detect).eq('en'))]
    
df = pd.read_csv('emplois_df_parsed.csv')

df = df[df['FonctionsStagiaire'].notna() & df['ExigencesParticulieres'].notna()]   #to remove empty values

#to check whether all empty values are removed or not in the column 'FonctionsStagiaire'
#df['FonctionsStagiaire'] = df['FonctionsStagiaire'].str.lower()  
#df = df[df['FonctionsStagiaire'].str.islower()]

#to check whether all empty values are removed or not in the column 'ExigencesParticulieres'
#df['ExigencesParticulieres'] = df['ExigencesParticulieres'].str.lower()
#df = df[df['ExigencesParticulieres'].str.islower()]

#to make sure that values of both the columns are of <str> datatype
df['FonctionsStagiaire'] = pd.Series(df['FonctionsStagiaire'], dtype = "string")
df['ExigencesParticulieres'] = pd.Series(df['ExigencesParticulieres'], dtype = "string")

#bool(re.match('^(?=.*[a-zA-Z])', df.loc[:, 'FonctionsStagiaire']))
#bool(re.match('^(?=.*[a-zA-Z])', df.loc[:, 'ExigencesParticulieres']))

#   df[df[column].map(lambda x: x.isascii())]

df_new = filter_nonenglish(df)

df_new.to_csv('emplois_df_filtered.csv', index= False)


Solution

  • I'm not sure but it seems like langdetect can't handle urls :

    tmp = df.loc[[24], ["FonctionsStagiaire", "ExigencesParticulieres"]].T
    ​
                                                                           24
    FonctionsStagiaire      https://www2.csrdn.qc.ca//files/jobs/P-22-875-...
    ExigencesParticulieres  https://www2.csrdn.qc.ca//files/jobs/P-22-875-...
    

    Using tmp[24].apply(detect) (row 25) throws a LangDetectException: No features in text.

    An alternative would be to use langid :

    #pip install langid
    from langid import classify
    
    use_cols = ["FonctionsStagiaire", "ExigencesParticulieres"]
    
    # checking english content
    is_en = [
        classify(r)[0] == "en"
        for r in df[use_cols].fillna("").agg(" ".join, axis=1)
    ]
    
    # not a null content
    not_na = df[use_cols].notna().all(axis=1)
    
    # not a random content (optional!)
    not_arc = df[use_cols].apply(lambda x: x.str.fullmatch("\w+\s?\d?")).any(axis=1)
    
    out = df.loc[is_en & not_na & ~not_arc]
    

    Output :

    print(out.loc[:, use_cols])
    
                                FonctionsStagiaire                   ExigencesParticulieres
    11     As a Level I Technical Customer Serv...  Qualifications           Your contri...
    106    What you'll create and do    Today's...  What you'll bring to this role:    A...
    140    The Training Department plays a cruc...  REQUIREMENTS:            \t     Coll...
    ...                                        ...                                      ...
    13608  CCHS Facility: Cleveland Clinic Cana...  MINIMUM QUALIFICATIONS:  · Registere...
    13697                               Calculate!                          Love numbers...
    13698                               Calculate!                          Love numbers...
    
    [311 rows x 2 columns]