Search code examples
pythonentityspacynamed-entity-recognition

NER - Entity Recognition - Country Filter


I want to extract Geo-relevant Info from an Excel file with spacy. It works to extract all Entities, but I just need the Geo-Data and donยดt find a way to filter the entities.

import pandas as pd
import spacy

sp = spacy.load("en_core_web_sm")
df = pd.read_excel("test.xlsx", usecols=["Bio", "Author"])
df.head(1)
df=df.fillna('')
#df['Bio']
doc = df.values.tolist()
#print (doc)
#sp(', '.join(doc[0])).ents
for entry in doc:
    #print('Current entry\n {}'.format(entry))
    for entity in sp(', '.join(entry)).ents:
        print(entity.text, entity.label)

Currently, the output looks like:

Munich 384

Germany 384

Venezuela 384

London 384

Portrait | 9191306739292312949

๐Ÿ“ โ„๐• ๐•Ÿ๐•˜ ๐•‚๐• ๐•Ÿ๐•˜ โ€‹ 383

๐Ÿ‡ฐ ๐ŸŒ ๐•‹๐•ฃ๐•’๐•ง๐•–๐•๐•๐•–๐•ฃโ€‹ 394

Visited:๐Ÿ‡ฌ๐Ÿ‡ง๐Ÿ‡ฌ 383

๐Ÿ‡ธ 384

๐Ÿ‡น 392

At the end I want to write the Geo-relevant Entities (if existing) back to the userยดs row in a new column "Location" in the csv.

I would appreciate your help very much, with kind regards


Solution

  • As mentioned, you can filter for the "LOC" or "GPE" entity provided by the spacy language model. However, be aware that the NER language model needs to have a sentence contex to be able to predict the location entities.

    sp = spacy.load("en_core_web_sm")
    # loop over every row in the 'Bio' column
    for text in df['Bio'].tolist():
        # use spacy to extract the entities
        doc = sp(text)
        for ent in doc.ents:    
            # check if entity is equal 'LOC' or 'GPE'
            if ent.label_ in ['LOC', 'GPE']:
                print(ent.text, ent.label_)   
    

    Here the link to the spacy NER documentation: https://spacy.io/usage/linguistic-features#named-entities

    EDIT

    Here is the full list of English spacy entity types from the documentation:

    • PERSON People, including fictional. NORP Nationalities or religious or political groups.
    • FAC Buildings, airports, highways, bridges, etc.
    • ORG Companies, agencies, institutions, etc.
    • GPE Countries, cities, states.
    • LOC Non-GPE locations, mountain ranges, bodies of water.
    • PRODUCT Objects, vehicles, foods, etc. (Not services.)
    • EVENT Named hurricanes, battles, wars, sports events, etc.
    • WORK_OF_ART Titles of books, songs, etc.
    • LAW Named documents made into laws.
    • LANGUAGE Any named language.
    • DATE Absolute or relative dates or periods.
    • TIME Times smaller than a day.
    • PERCENT Percentage, including โ€%โ€œ.
    • MONEY Monetary values, including unit.
    • QUANTITY Measurements, as of weight or distance.
    • ORDINAL โ€œfirstโ€, โ€œsecondโ€, etc.
    • CARDINAL Numerals that do not fall under another type.

    Source: https://spacy.io/api/annotation#named-entities