I have the following df
:
df = pd.DataFrame({
'user': ['Id159', 'Id758', 'Id146', 'Id477', 'Id212', 'Id999'],
'comment' : ["I inboxed you", '123', 123, 'je suis fatigué', "j'aime", 'ما نوع الجهاز بالضبط']
})
It has the following display:
user comment
0 Id159 I inboxed you
1 Id758 123
2 Id146 123
3 Id477 je suis fatigué
4 Id212 j'aime
5 Id999 ما نوع الجهاز بالضبط
My goal is to get a new column containing language used in the column df['comment']
as follows:
user comment language
0 Id159 I inboxed you en
1 Id758 123 UNKNOWN
2 Id146 123 UNKNOWN
3 Id477 je suis fatigué fr
4 Id212 j'aime fr
5 Id999 ما نوع الجهاز بالضبط ar
My code
from langdetect import detect
df['language'] = [detect(x) for x in df['comment']]
When I tried to use detect
I faced the following message error:
LangDetectException: No features in text.
I tried to add an if else
statement but douldn't solve the issue.
Any help from your side will be highly appreciated (I upvote all answers)
Than you!
It would be better if you clarify all exception cases you want to set as UNKNOWN
.
Anyway, I assume you want to set non-string and numeric into UNKNOWN
.
Then,
df["language"] = [
detect(x) if isinstance(x, str) and not x.isnumeric() else "UNKNOWN"
for x in df["comment"]
]
EDIT:
Or for more general approach (though not really recommended) you can just use exception handling
def f(x):
try:
return detect(x)
except:
return "UNKNOWN"
df["language"] = [f(x) for x in df["comment"]]