Search code examples
pythondatespacynamed-entity-recognition

Spacy : Named entity Recognition on dates not working as expected


I am not sure that I understand exactly how spacy identifies named entites in a text, and in my case especially dates.

I am trying to extract the education + the respective date in a text document. I have something like this

text = 'University of A  2019 - 2020
        University of B  2016 - 2019
        College A        2013 - 2016
        College B        2008 - 2013'
doc = nlp(text)
for ent in doc.ents:
     print(ent.text, ent.label_)

Which gives me as output :

University of A  ORG
University of B  ORG
2016 - 2019      DATE
2013 - 2016      DATE
2008 - 2013      DATE

As expected the universities are recognized as organizations and I expected spacy not to recognize the colleges as it's less obvious than the university names. However I do not understand why I lost the first date but all the others work fine.

I tried on another text that was something like this :

1997 : any text
1998 : any text
1999 : any text
...
2018 : any text

And here all dates where recognized except 2013 and 2018, although the format of the lines are the same as all the others.

Is there a way to train spacy to better recognize the dates or should I use another tool? I'm already using spacy for other parts of the same program. I'm not using regex right now cause the dates can be in so many different formats (only year, beginning year - end year, sometimes months and days too, etc.)


Solution

  • You need a more feature-rich model type, the one with _md or _lg suffix with spacy 2.x and _trf with spacy 3.x.

    For example, you may install

    python -m spacy download en_core_web_trf
    

    Then, you may use

    import spacy
    nlp = spacy.load('en_core_web_trf')
    text = '''University of A  2019 - 2020
             University of B  2016 - 2019
             College A        2013 - 2016
             College B        2008 - 2013'''
    doc = nlp(text)
    for ent in doc.ents:
        print(ent.text, ent.label_)
    

    Output:

    2019 - 2020 DATE
    2016 - 2019 DATE
    2013 - 2016 DATE
    2008 - 2013 DATE