I have a dataset of tweets that contains tweets mainly from English but also have several tweets in Indian Languages (such as Punjabi, Hindi, Tamil etc.). I want to keep only English language tweets and remove rows with different language tweets. I tried this [https://stackoverflow.com/questions/67786493/pandas-dataframe-filter-out-rows-with-non-english-text] and it worked on the sample dataset. However, when I tried it on my dataset it showed error:
LangDetectException: No features in text.
Also, I have already checked other question [https://stackoverflow.com/questions/69804094/drop-non-english-rows-pandasand] where the accepted answer talks about this error and mentioned that empty rows might be the reason for this error, so I already cleaned my dataset to remove all the empty rows.
Simple code which worked on sample data but not on original data:
from langdetect import detect
import pandas as pd
df = pd.read_csv('Sample.csv')
df_new = df[df.text.apply(detect).eq('en')]
print('New df is: ', df_new)
How can I check which row is producing error?
Thanks in Advance!
Use custom function for return True
if function detect
failed:
df = pd.read_csv('Sample.csv')
def f(x):
try:
detect(x)
return False
except:
return True
s = df.loc[df.text.apply(f), 'text']
Another idea is create new column filled by detect
, if failed return NaN
, last filtr rows with missing values to df1
and also df_new
with new column filled by ouput of function detect
:
df = pd.read_csv('Sample.csv')
def f1(x):
try:
return detect(x)
except:
return np.nan
df['new'] = df.text.apply(f1)
df1 = df[df.new.isna()]
df_new = df[df.new.eq('en')]