I am not sure that I understand exactly how spacy identifies named entites in a text, and in my case especially dates.
I am trying to extract the education + the respective date in a text document. I have something like this
text = 'University of A 2019 - 2020
University of B 2016 - 2019
College A 2013 - 2016
College B 2008 - 2013'
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.label_)
Which gives me as output :
University of A ORG
University of B ORG
2016 - 2019 DATE
2013 - 2016 DATE
2008 - 2013 DATE
As expected the universities are recognized as organizations and I expected spacy not to recognize the colleges as it's less obvious than the university names. However I do not understand why I lost the first date but all the others work fine.
I tried on another text that was something like this :
1997 : any text
1998 : any text
1999 : any text
...
2018 : any text
And here all dates where recognized except 2013 and 2018, although the format of the lines are the same as all the others.
Is there a way to train spacy to better recognize the dates or should I use another tool? I'm already using spacy for other parts of the same program. I'm not using regex right now cause the dates can be in so many different formats (only year, beginning year - end year, sometimes months and days too, etc.)
You need a more feature-rich model type, the one with _md
or _lg
suffix with spacy 2.x and _trf
with spacy 3.x.
For example, you may install
python -m spacy download en_core_web_trf
Then, you may use
import spacy
nlp = spacy.load('en_core_web_trf')
text = '''University of A 2019 - 2020
University of B 2016 - 2019
College A 2013 - 2016
College B 2008 - 2013'''
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.label_)
Output:
2019 - 2020 DATE
2016 - 2019 DATE
2013 - 2016 DATE
2008 - 2013 DATE