Here is a sample dataset:
ID | Details |
---|---|
1 | Here Are the Details on Facebook's Global Part... |
2 | Aktien New York Schluss: Moderate Verluste nac... |
3 | Clôture de Wall Street : Trump plombe la tend... |
4 | '' |
5 | NaN |
I need to add 'Language' column, which represents what language is used in 'Details' column, so that in the end it will look like this:
ID | Details | Language |
---|---|---|
1 | Here Are the Details on Facebook's Global Part... | en |
2 | Aktien New York Schluss: Moderate Verluste nac... | de |
3 | Clôture de Wall Street : Trump plombe la tend... | fr |
4 | '' | NaN |
5 | NaN | NaN |
I tried this code:
!pip install langdetect
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
df2=df.dropna(subset=['Details'])
df2['Language']=df2['Details'].apply(detect)
It failed, I guess it is because of rows that have values like 'ID'=4. Therefore, I tried this:
!pip install langdetect
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
df2=df.dropna(subset=['Details'])
df2['Language']=df2['Details'].apply(lambda x: detect(x) if len(x)>1 else np.NaN)
However, I still got an error:
LangDetectException: No features in text.
You can catch the error and return NaN
from the function you apply. Note that you can give any callable that takes one input and returns one output as the argument to .apply()
, it doesn't have to be a lambda
def detect_lang(x):
if len(x) <= 1: return np.nan
try:
lang = detect(x)
if lang: return lang # Return lang if lang is not empty
except langdetect.LangDetectException:
pass # Don't do anything when you get an error, so you can fall through to the next line, which returns a Nan
return np.nan # If lang was empty or there was an error, we reach this line
df2['Language']=df2['Details].apply(detect_lang)
I'm not sure why you had if len(x)>1
in there: that would only return NaN
when the original string has zero or one characters, but I included it in my detect_lang
function to keep the functionality consistent with your lambda.