I have a dataframe which has 27 columns including columns FonctionsStagiaire
and ExigencesParticulieres
. The dataframe has 13774 rows which are either entirely in English or French. The csv file can be found here: GDrive link
I am trying to keep only those rows of the dataframe where the values or contents of FonctionsStagiaire
or ExigencesParticulieres
are in English. I want to drop the entire rows where these columns contain values in French language.
I am using langdetect
but is getting the error langdetect.lang_detect_exception.LangDetectException: No features in text.
I checked out all the solutions on SO to resolve this error but nothing is working.
I also want to check for if these 2 columns contain only NaN
values, only numeric or whitespace characters or only punctuation marks, etc. which can't be detected whether it is in English or French and might cause error while using langdetect
.
When I am trying to see which row is throwing the error based on this SO question it is showing that This row throws error
for every row.
I intend to use langdetect
but any other solution would also be helpful as long as it works properly.
Any help is much appreciated.
I am trying with this code:
from langdetect import detect
import pandas as pd
import re
import string
def filter_nonenglish(df):
list = ['FonctionsStagiaire', 'ExigencesParticulieres']
for col in list:
#to check if values are only whitespaces or only numeric characters or only punctuation marks
#something like this -> if (df[col].apply(lambda x: x.isnumeric()) == True) | (df[col].apply(lambda x: x.isspace()) == True) | (all(i in string.punctuation for i in df[col]) == True):
return False
else:
#trying to use apply() to apply detect() to each <str> values of the columns and not to the entire pd.Series object
new_df = df[(df[col].apply(detect).eq('en'))]
return new_df
#df['FonctionsStagiaire'] = df['FonctionsStagiaire'].apply(detect)
#df['ExigencesParticulieres'] = df['ExigencesParticulieres'].apply(detect)
#df = df[df['FonctionsStagiaire'] == 'en']
#df = df[df['ExigencesParticulieres'] == 'en']
#new_df = df[(df.FonctionsStagiaire.apply(detect).eq('en')) & (df.ExigencesParticulieres.apply(detect).eq('en'))]
df = pd.read_csv('emplois_df_parsed.csv')
df = df[df['FonctionsStagiaire'].notna() & df['ExigencesParticulieres'].notna()] #to remove empty values
#to check whether all empty values are removed or not in the column 'FonctionsStagiaire'
#df['FonctionsStagiaire'] = df['FonctionsStagiaire'].str.lower()
#df = df[df['FonctionsStagiaire'].str.islower()]
#to check whether all empty values are removed or not in the column 'ExigencesParticulieres'
#df['ExigencesParticulieres'] = df['ExigencesParticulieres'].str.lower()
#df = df[df['ExigencesParticulieres'].str.islower()]
#to make sure that values of both the columns are of <str> datatype
df['FonctionsStagiaire'] = pd.Series(df['FonctionsStagiaire'], dtype = "string")
df['ExigencesParticulieres'] = pd.Series(df['ExigencesParticulieres'], dtype = "string")
#bool(re.match('^(?=.*[a-zA-Z])', df.loc[:, 'FonctionsStagiaire']))
#bool(re.match('^(?=.*[a-zA-Z])', df.loc[:, 'ExigencesParticulieres']))
# df[df[column].map(lambda x: x.isascii())]
df_new = filter_nonenglish(df)
df_new.to_csv('emplois_df_filtered.csv', index= False)
I'm not sure but it seems like langdetect
can't handle urls :
tmp = df.loc[[24], ["FonctionsStagiaire", "ExigencesParticulieres"]].T
24
FonctionsStagiaire https://www2.csrdn.qc.ca//files/jobs/P-22-875-...
ExigencesParticulieres https://www2.csrdn.qc.ca//files/jobs/P-22-875-...
Using tmp[24].apply(detect)
(row 25) throws a LangDetectException: No features in text.
An alternative would be to use langid
:
#pip install langid
from langid import classify
use_cols = ["FonctionsStagiaire", "ExigencesParticulieres"]
# checking english content
is_en = [
classify(r)[0] == "en"
for r in df[use_cols].fillna("").agg(" ".join, axis=1)
]
# not a null content
not_na = df[use_cols].notna().all(axis=1)
# not a random content (optional!)
not_arc = df[use_cols].apply(lambda x: x.str.fullmatch("\w+\s?\d?")).any(axis=1)
out = df.loc[is_en & not_na & ~not_arc]
Output :
print(out.loc[:, use_cols])
FonctionsStagiaire ExigencesParticulieres
11 As a Level I Technical Customer Serv... Qualifications Your contri...
106 What you'll create and do Today's... What you'll bring to this role: A...
140 The Training Department plays a cruc... REQUIREMENTS: \t Coll...
... ... ...
13608 CCHS Facility: Cleveland Clinic Cana... MINIMUM QUALIFICATIONS: · Registere...
13697 Calculate! Love numbers...
13698 Calculate! Love numbers...
[311 rows x 2 columns]