I need to extract from strings all the locations.
This is my code:
import nltk
from nltk.tokenize import word_tokenize
import spacy
nltk.download('punkt')
nlp = spacy.load('it_core_news_sm')
def extract_destinations(sentence):
destinations = []
exclusions = []
search_type = "Viaggio"
sentence = sentence.capitalize()
# Utilizzo di SpaCy per l'analisi grammaticale
doc = nlp(sentence)
# Estrazione delle entità denominate
for entity in doc.ents:
if entity.label_ == 'LOC':
destinations.append(entity.text)
# Estraggo la tipologia di ricerca da fare
for single_doc in doc:
if single_doc.lemma_ == 'volo' or single_doc.lemma_ == 'volare':
search_type = 'Volo'
return destinations, search_type
# Esempi di frasi di viaggio
sentences = [
"vorrei visitare la toscana ma non voglio vedere pisa",
"Voglio fare un viaggio in europa, non italia e spagna",
"Vorrei vistare il lazio senza passare da roma",
"voglio volare in Perù", # Do not wok
]
# Elaborazione delle frasi
for sentence in sentences:
destinations, search_type = extract_destinations(sentence)
print("Frase: {}".format(sentence))
print("Destinazioni: {}".format(destinations))
print("Tipo di ricerca: {}\n".format(search_type))
In this code I need to check if a destination should be added to destinations
. I have about 20 sentence but I have added in the code only a few of them to demonstrate the problem.
The issue is with the last sentence: "voglio visitare il Perù"
( I want to vist Perù) that, don't know why is not elaborated correctly. The problem is on voglio
, for some reason in this case the word is managed differently form the other, as you can see I used it on the others.
If I run the code I get:
Frase: vorrei visitare la toscana ma non voglio vedere pisa
Destinazioni: ['toscana', 'pisa']
Tipo di ricerca: Viaggio
Frase: Voglio fare un viaggio in europa, non italia e spagna
Destinazioni: ['europa', 'italia', 'spagna']
Tipo di ricerca: Viaggio
Frase: Vorrei vistare il lazio senza passare da roma
Destinazioni: ['lazio', 'roma']
Tipo di ricerca: Viaggio
Frase: voglio visitare il Perù
Destinazioni: []
Tipo di ricerca: Volo
I need that in the last entry, the Perù
is added in Destinazioni
. If I just remove the word voglio
the code works and Perù is correctly added.
Don't understand what's the difference between the last sentence and the others.
Did I miss something?
You might try removing sentence = sentence.capitalize()
. That's going to lower case the sentence, except for the first character. I'm not sure exactly which architecture it_core_news_sm
is using, but most NER models rely heavily on capitalization to identify named entities.
I've also found that the NER models trained on WikiNER (which the spaCy Italian model is) often don't perform all that well. You can try the the larger Italian model (it_core_news_lg
) and see if that helps. I usually find noticeable improvements in NER when using the larger models, but of course they come at some resource costs. If that's the case, though, you can disable the pipeline components you don't need, e.g. disable=["tagger", "parser"]