python pandas dataframe language-detection

How to detect the language used in a column and put it in a new column?

I have the following df:

df = pd.DataFrame({
    'user': ['Id159', 'Id758', 'Id146', 'Id477', 'Id212', 'Id999'],
    'comment' : ["I inboxed you", '123', 123, 'je suis fatigué', "j'aime", 'ما نوع الجهاز بالضبط']  
})

It has the following display:

    user    comment
0   Id159   I inboxed you
1   Id758   123
2   Id146   123
3   Id477   je suis fatigué
4   Id212   j'aime
5   Id999   ما نوع الجهاز بالضبط

My goal is to get a new column containing language used in the column df['comment'] as follows:

    user    comment         language
0   Id159   I inboxed you   en
1   Id758   123             UNKNOWN
2   Id146   123             UNKNOWN
3   Id477   je suis fatigué fr
4   Id212   j'aime          fr
5   Id999   ما نوع الجهاز بالضبط  ar

My code

from langdetect import detect

df['language'] = [detect(x) for x in df['comment']]

When I tried to use detect I faced the following message error:

LangDetectException: No features in text.

I tried to add an if else statement but douldn't solve the issue.

Any help from your side will be highly appreciated (I upvote all answers)

Than you!

Solution

It would be better if you clarify all exception cases you want to set as UNKNOWN.

Anyway, I assume you want to set non-string and numeric into UNKNOWN.

Then,

df["language"] = [
    detect(x) if isinstance(x, str) and not x.isnumeric() else "UNKNOWN"
    for x in df["comment"]
]

EDIT:

Or for more general approach (though not really recommended) you can just use exception handling

def f(x):
    try:
        return detect(x)
    except:
        return "UNKNOWN"

df["language"] = [f(x) for x in df["comment"]]