Search code examples
python-3.xpandasdataframetwitter

How to check which row in producing LangDetectException error in LangDetect?


I have a dataset of tweets that contains tweets mainly from English but also have several tweets in Indian Languages (such as Punjabi, Hindi, Tamil etc.). I want to keep only English language tweets and remove rows with different language tweets. I tried this [https://stackoverflow.com/questions/67786493/pandas-dataframe-filter-out-rows-with-non-english-text] and it worked on the sample dataset. However, when I tried it on my dataset it showed error:

LangDetectException: No features in text.

Also, I have already checked other question [https://stackoverflow.com/questions/69804094/drop-non-english-rows-pandasand] where the accepted answer talks about this error and mentioned that empty rows might be the reason for this error, so I already cleaned my dataset to remove all the empty rows.

Simple code which worked on sample data but not on original data:

from langdetect import detect
import pandas as pd

df = pd.read_csv('Sample.csv')
df_new = df[df.text.apply(detect).eq('en')]
print('New df is: ', df_new) 

How can I check which row is producing error?

Thanks in Advance!


Solution

  • Use custom function for return True if function detect failed:

    df = pd.read_csv('Sample.csv')
    
    def f(x):
        try:
            detect(x)
            return False
        except:
            return True
    
    s = df.loc[df.text.apply(f), 'text']
    

    Another idea is create new column filled by detect, if failed return NaN, last filtr rows with missing values to df1 and also df_new with new column filled by ouput of function detect:

    df = pd.read_csv('Sample.csv')
    
    def f1(x):
        try:
            return detect(x)
        except:
            return np.nan
    
    df['new'] = df.text.apply(f1)
    
    df1 = df[df.new.isna()]
    
    df_new = df[df.new.eq('en')]