Search code examples
pythonpandasspacynamed-entity-recognition

How to extract Named Entities from Pandas DataFrame using SpaCy


I am trying to extract Named Entities using first answer to this question and code is as following

for i in df['Article'].to_list():
    doc = nlp(i)
    for entity in doc.ents:
        print((entity.text))

But it is not printing entities. I have tried print(i) and print(doc) both variables have values and df['Article'] contains news text. Can someone help with why second loop is not extracting entities? Thank you

EDIT:
This is dataset file, please run following code to form preprocessing that I have done.

df.iloc[:,0].dropna(inplace=True)
df = df[df.iloc[:,0].notna()]

to remove special characters from df['Articles']

df['Article'] = df['Article'].map(lambda x: re.sub(r'\W+', '', x))

Solution

  • With df['Article'].map(lambda x: re.sub(r'\W+', '', x)), you remove all whitespace chars from your articles.

    You need to use

    df['Article'] = df['Article'].str.replace(r'(?:_|[^\w\s])+', '')
    

    With that regex, you will only remove special chars other than whitespaces.