Search code examples
pythonpandasdataframelanguage-detection

How to detect the language used in a column and put it in a new column?


I have the following df:

df = pd.DataFrame({
    'user': ['Id159', 'Id758', 'Id146', 'Id477', 'Id212', 'Id999'],
    'comment' : ["I inboxed you", '123', 123, 'je suis fatigué', "j'aime", 'ما نوع الجهاز بالضبط']  
})

It has the following display:

    user    comment
0   Id159   I inboxed you
1   Id758   123
2   Id146   123
3   Id477   je suis fatigué
4   Id212   j'aime
5   Id999   ما نوع الجهاز بالضبط

My goal is to get a new column containing language used in the column df['comment'] as follows:

    user    comment         language
0   Id159   I inboxed you   en
1   Id758   123             UNKNOWN
2   Id146   123             UNKNOWN
3   Id477   je suis fatigué fr
4   Id212   j'aime          fr
5   Id999   ما نوع الجهاز بالضبط  ar

My code

from langdetect import detect

df['language'] = [detect(x) for x in df['comment']]

When I tried to use detect I faced the following message error:

LangDetectException: No features in text.

I tried to add an if else statement but douldn't solve the issue.

Any help from your side will be highly appreciated (I upvote all answers)

Than you!


Solution

  • It would be better if you clarify all exception cases you want to set as UNKNOWN.

    Anyway, I assume you want to set non-string and numeric into UNKNOWN.

    Then,

    df["language"] = [
        detect(x) if isinstance(x, str) and not x.isnumeric() else "UNKNOWN"
        for x in df["comment"]
    ]
    

    EDIT:

    Or for more general approach (though not really recommended) you can just use exception handling

    def f(x):
        try:
            return detect(x)
        except:
            return "UNKNOWN"
    
    df["language"] = [f(x) for x in df["comment"]]