Search code examples
pythondatedictionaryreadability

Pythonic way to filter data with overlapping dates


I have this data structure where each team has list of issues with start/end dates.

For each team, I would like to merge issues with same key and overlapping dates, where in result issue the start date will be smaller date and end date will be bigger date.

I am trying to do it with few for loops but I was wondering what would be the best Pythonic way to do this.

Update

I want to merge only issues with same key within same team and with overlapping dates.

Issues are not in chronological order.

Input:

{
    'Team A': [{
        'start': '11/Jul/13 1:49 PM',
        'end': '10/Oct/13 5:16 PM',
        'issue': 'KEY-12678'
    }, {
        'start': '3/Oct/13 10:40 AM',
        'end': '11/Nov/13 1:02 PM',
        'issue': 'KEY-12678'
    }],

    'Team B': [{
        'start': '5/Sep/13 3:35 PM',
        'end': '08/Nov/13 3:35 PM',
        'issue': 'KEY-12679'
    }, {
        'start': '19/Aug/13 5:05 PM',
        'end': '10/Sep/13 5:16 PM',
        'issue': 'KEY-12679'
    }, {
        'start': '09/Jul/13 9:15 AM',
        'end': '29/Jul/13 9:15 AM',
        'issue': 'KEY-12680'
    }]
}

Output:

{
    'Team A': [{
        'start': '11/Jul/13 1:49 PM',
        'end': '11/Nov/13 1:02 PM',
        'issue': 'KEY-12678'
    }],
    'Team B': [{
        'start': '19/Aug/13 5:05 PM',
        'end': '08/Nov/13 3:35 PM',
        'issue': 'KEY-12679'
    }, {
        'start': '09/Jul/13 9:15 AM',
        'end': '29/Jul/13 9:15 AM',
        'issue': 'KEY-12680'
    }]
}

To parse date, here is date format (to save you couple of minutes):

date_format = "%d/%b/%y %H:%M %p"

Update, new test data

Input

d = {
"N/A": [
  {'start': '23/Jun/14 8:48 PM', 'end': '01/Aug/14 11:00 PM', 'issue': 'KEY-12157'}
  ,{'start': '09/Jul/13 1:57 PM',  'end': '29/Jul/13 1:57 PM', 'issue': 'KEY-12173'}
  ,{'start': '21/Aug/13 12:29 PM', 'end': '02/Dec/13 6:06 PM', 'issue': 'KEY-12173'}
  ,{'start': '17/Feb/14 3:17 PM', 'end': '18/Feb/14 5:51 PM', 'issue': 'KEY-12173'}
  ,{'start': '12/May/14 4:42 PM', 'end': '02/Jun/14 4:42 PM', 'issue': 'KEY-12173'}
  ,{'start': '24/Jun/14 11:33 AM',  'end': '01/Aug/14 11:49 AM', 'issue': 'KEY-12173'}
  ,{'start': '07/Oct/14 1:17 PM',  'end': '17/Nov/14 10:30 AM', 'issue': 'KEY-12173'}
  ,{'start': '31/Mar/15 1:58 PM', 'end': '12/May/15 4:26 PM', 'issue': 'KEY-12173'}
  ,{'start': '15/Jul/14 10:06 AM',  'end': '15/Sep/14 5:25 PM', 'issue': 'KEY-12173'}
  ,{'start': '06/Jan/15 10:46 AM',  'end': '26/Jan/15 10:46 AM', 'issue': 'KEY-20628'}
  ,{'start': '18/Nov/14 5:08 PM',  'end': '16/Feb/15 1:31 PM', 'issue': 'KEY-20628'}
  ,{'start': '02/Oct/13 12:32 PM', 'end': '21/Oct/13 5:32 PM', 'issue': 'KEY-12146'}
  ,{'start': '11/Mar/14 12:08 PM', 'end': '31/Mar/14 12:08 PM', 'issue': 'KEY-12681'}
  ]}

Output

{'start': '18/Nov/14 05:08 AM', 'issue': 'KEY-20628', 'end': '16/Feb/15 01:31 AM'}
{'start': '09/Jul/13 1:57 PM', 'issue': 'KEY-12173', 'end': '29/Jul/13 1:57 PM'}
{'start': '21/Aug/13 12:29 PM', 'issue': 'KEY-12173', 'end': '02/Dec/13 6:06 PM'}
{'start': '17/Feb/14 3:17 PM', 'issue': 'KEY-12173', 'end': '18/Feb/14 5:51 PM'}
{'start': '12/May/14 4:42 PM', 'issue': 'KEY-12173', 'end': '02/Jun/14 4:42 PM'}
{'start': '24/Jun/14 11:33 AM', 'issue': 'KEY-12173', 'end': '15/Sep/14 05:25 AM'}
{'start': '07/Oct/14 1:17 PM', 'issue': 'KEY-12173', 'end': '17/Nov/14 10:30 AM'}
{'start': '31/Mar/15 1:58 PM', 'issue': 'KEY-12173', 'end': '12/May/15 4:26 PM'}
{'start': '11/Mar/14 12:08 PM', 'issue': 'KEY-12681', 'end': '31/Mar/14 12:08 PM'}
{'start': '23/Jun/14 8:48 PM', 'issue': 'KEY-12157', 'end': '01/Aug/14 11:00 PM'}
{'start': '02/Oct/13 12:32 PM', 'issue': 'KEY-12146', 'end': '21/Oct/13 5:32 PM'}

Solution

  • I am presenting a pandas solution as hinted by aquavitae in a comment, which contains the following steps:

    • Read in the data from the dictionary d, that you provided into a DataFrame.
    • Convert the start and end column into datetime objects.
    • Sort data by keys and by start date and reset the index
    • Loop over the data frame (inefficient, but I couldn't come up with something better until now) and compare end time of current line with start time of next line, as well as if the keys are equal.
    • Query pandas data frame to get the lines that overlap
    • Loop over the to-be-dropped-out rows and merge the data into the corresponding overlapping row.
    • Drop these rows.
    • Convert back to the dictionary format.

    This looks like:

    import pandas as pd
    import numpy as np
    df = pd.DataFrame(d['N/A'])
    df['end'] = pd.to_datetime(df['end'])
    df['start'] = pd.to_datetime(df['start'])
    df.sort(['issue', 'start'], inplace=True)
    df.index = range(len(df))
    time_overlaps = df[:-1]['end'] > df[1:]['start']
    same_issue = df[:-1]['issue'] == df[1:]['issue']
    rows_to_drop = np.logical_and(time_overlaps, same_issue)
    rows_to_drop_indices = [i+1 for i, j in enumerate(rows_to_drop) if j]
    for i in rows_to_drop_indices:
        df.loc[i-1, 'end'] = df.loc[i, 'end']
    df.drop(rows_to_drop_indices, inplace=True)
    

    If you don't want to keep the DataFrame object and do further calculations in the format you specified in your question do:

    df.to_dict('records')
    

    EDIT: Found an efficient way to do it!