I want to extract Geo-relevant Info from an Excel file with spacy. It works to extract all Entities, but I just need the Geo-Data and donΒ΄t find a way to filter the entities.
import pandas as pd
import spacy
sp = spacy.load("en_core_web_sm")
df = pd.read_excel("test.xlsx", usecols=["Bio", "Author"])
df.head(1)
df=df.fillna('')
#df['Bio']
doc = df.values.tolist()
#print (doc)
#sp(', '.join(doc[0])).ents
for entry in doc:
#print('Current entry\n {}'.format(entry))
for entity in sp(', '.join(entry)).ents:
print(entity.text, entity.label)
Currently, the output looks like:
Munich 384
Germany 384
Venezuela 384
London 384
Portrait | 9191306739292312949
π βπ ππ ππ ππ β 383
π° π ππ£ππ§πππππ£β 394
Visited:π¬π§π¬ 383
πΈ 384
πΉ 392
At the end I want to write the Geo-relevant Entities (if existing) back to the userΒ΄s row in a new column "Location" in the csv
.
I would appreciate your help very much, with kind regards
As mentioned, you can filter for the "LOC" or "GPE" entity provided by the spacy language model. However, be aware that the NER language model needs to have a sentence contex to be able to predict the location entities.
sp = spacy.load("en_core_web_sm")
# loop over every row in the 'Bio' column
for text in df['Bio'].tolist():
# use spacy to extract the entities
doc = sp(text)
for ent in doc.ents:
# check if entity is equal 'LOC' or 'GPE'
if ent.label_ in ['LOC', 'GPE']:
print(ent.text, ent.label_)
Here the link to the spacy NER documentation: https://spacy.io/usage/linguistic-features#named-entities
EDIT
Here is the full list of English spacy entity types from the documentation: