Search code examples
pythondatedatetimenltkpython-dateutil

How to extract time date period information from raw sentences in Python


Input:

  1. Valid for ticketing and travelling Starting from Mar 27 2016 to Dec 31 2016
  2. Effective Period Tickets must be issued on before 18 FEB 16
  3. Effective Period Ticket must be issued on before 29 FEB 2016
  4. TRAVELING DATES NOW - FEB 10 2016 FEB 22 2016 - MAY 12 2016
  5. Ticketing Effective Period on before 31 Jan 2016

(Note: The input has been preprocessed to this stage by some Python codes so that it will be easier to process using some Python packages.)

Expected output:

  1. from 2016-03-27 to 2016-12-31
  2. on before 2016-02-18
  3. on before 2016-02-29
  4. now - 2016-02-10 2016-02-22 - 2016-05-12
  5. on before 2016-01-31

I have tried dateutil. However it can only extract one date, right? Even for this situation, extraction of both preposition and date is also a problem.

I also looked at dateparser and datefinder. It seems they both use dateutil.

Dates can be YYYY-MM-DD, DDMMYYYY, etc., as long as in the same format.

Output doesn't have to be identical to the above one, as long as it reflects accurate information.

Finally, thanks for your time and thoughts. I will also keep trying.


Solution

  • After a few days of research, I come up with the following approaches which solve the extraction problem.

    1. Recognize the propositions and then recognize months and do the extraction.
    2. Recognize '-' and then recognize months and do the extraction.

    Part the codes are shown below. (An excerpt which need dependencies in context)

    new_w = new_s.split()
    for j in range(len(new_w)):
        if new_w[j] in prepositions and (new_w[j+1].isdecimal() or new_w[j+1].lower() in months):
            # Process case like "Starting from Mar27, 2016 to Dec31, 2016"
            if j+7 in range(len(new_w)) and new_w[j+4] in prepositions:
                if new_w[j+5].isdecimal() or new_w[j+5].lower() in months:
                    u = ' '.join(new_w[j:j+8])
                    print(label_class[i] + ': ' + u)
                    break
            # Process case like "Ticket must be issued on/before 29FEB, 2016"
            elif new_w[j-1] in prepositions:
                u = ' '.join(new_w[j-1:j+4])
                print(label_class[i] + ': ' + u)
                break
            # Process case like "Ticketing valid until 18FEB16"
            else:
                u = ' '.join(new_w[j:j+4])
                print(label_class[i] + ': ' + u)
                break
        # Process case like "TICKETING PERIOD:      NOW - FEB 02, 2016"
        # Process case like "TRAVELING DATES:      NOW - FEB 10,2016    FEB 22,2016 - MAY 12,2016"
        if new_w[j] in ['-'] and (new_w[j+1].lower() in months or new_w[j+2].lower() in months):
            if new_w[j-1].lower() == 'now':
                u = released_date + ' - ' + ' '.join(new_w[j+1:j+4])
                print(label_class[i] + ': ' + u)
            elif new_w[j-3].lower() in months or new_w[j-2].lower() in months:
                u = ' '.join(new_w[j-3:j+4])
                print(label_class[i] + ': ' + u)