Search code examples
pythonregexnlpspacynamed-entity-recognition

Extracting dates from a sentence in spaCy


I have a string like so:

"The dates are from 30 June 2019 to 1 January 2022 inclusive"

I want to extract the dates from this string using spaCy.

Here is my function so far:

def extract_dates_with_year(text):
    doc = nlp(text)
    dates_with_year = []
    for ent in doc.ents:
        if ent.label_ == "DATE":
            dates_with_year.append(ent.text)
    return dates_with_year

This returns the following output:

['30 June 2019 to 1 January 2022']

However, I want output like:

['30 June 2019', '1 January 2022']

Solution

  • The issue is that "to" is considered part of the date. So when you do for ent in doc.ents, your loop only has one iteration, as "30 June 2019 to 1 January 2022" is considered one entity.

    As you don't want this behaviour, you can amend your function to split on "to":

    def extract_dates_with_year(text):
        doc = nlp(text)
        dates_with_year = []
        for ent in doc.ents:
            if ent.label_ == "DATE":
                for ent_txt in ent.text.split("to"):
                    dates_with_year.append(ent_txt.strip())
        return dates_with_year
    

    This will correctly handle dates like these, as well as single dates, and strings with multiple dates:

    txt = """
         The dates are from 30 June 2019 to 1 January 2022 inclusive.
         And oddly also 5 January 2024.
         And exclude 21 July 2019 until 23 July 2019.
    """
    
    extract_dates_with_year(txt)
    
    # Output:
    [
     '30 June 2019',
     '1 January 2022',
     '5 January 2024',
     '21 July 2019',
     '23 July 2019'
    ]