Search code examples
pythonpython-3.xlogparser

Merge overlapping operation in csv without Pandas


I have a csv file that contains data like that

Sample csv

Name Start End
John 12:00 13:00
John 12:10 13:00
John 12:20 13:20
Tom 12:00 13:10
John 13:50 14:00
Jerry 14:00 14:30
Alice 15:00 16:00
Jerry 11:00 15:00

I need to perform Merging operation on overlapping intervals

Before merge

  • John [12:00,13:00],[12:10,13:00],[12:20,13:20],[13:50,14:00]
  • Jerry [14:00,14:35],[11:00,15:00]
  • Tom [15:00,16:00]
  • Alice [12:00,13:10]

After merge

  • John [12:00,13:20],[13:50,14:00]
  • Jerry [11:00,15:00]
  • Alice [12:00 ,13:10]
  • Tom [15:00,16:00]

with pandas its quite easily done. But i need to use it via python core libraries

I am tryring to do with csv_reader

import csv

dict = {}
with open('person.csv', mode='r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    for row in csv_reader:
        name = row["Name"]
        start_time = row["Start"]
        end_time = row["End"]
        dict[name] = [start_time,end_time]
        

but i am not sure how to compare the end and starting index to check overlapping condition


Solution

  • Sorting the data as a first step makes the computation easier. To avoid using index everywhere I chose to use a NamedTuple :

    from collections import namedtuple
    
    Interval = namedtuple("Interval", "name start end")
    
    l = [
        Interval("John",    "12:00",    "13:00"),
        Interval("John",    "12:10",    "13:00"),
        Interval("John",    "12:20",    "13:20"),
        Interval("Tom",     "12:00",    "13:10"),
        Interval("John",    "13:50",    "14:00"),
        Interval("Jerry",   "14:00",    "14:30"),
        Interval("Alice",   "15:00",    "16:00"),
        Interval("Jerry",   "11:00",    "15:00"),
    ]
    
    sorted_l = sorted(l)
    result = []
    
    for interval in sorted_l:
        # init
        if not result:
            result.append(interval)
            continue
    
        last_interval = result[-1]
        
        # Not the same person
        if last_interval.name != interval.name:
            result.append(interval)
    
        # No intersection between intervals
        elif interval.start > last_interval.end:
            result.append(interval)
    
        # Overlap between intervals
        elif last_interval.start <= interval.start <= last_interval.end:
            # Upper bound can be the end of either interval
            upper_bound = max(last_interval.end, interval.end)
            # We overwrite the last interval with the new upper bound
            result[-1] = Interval(last_interval.name, last_interval.start, upper_bound)
        else:
            raise ValueError("Case not handled :(")
    
    print(result)
    
    
    >[
        Interval(name='Alice', start='15:00', end='16:00'), 
        Interval(name='Jerry', start='11:00', end='15:00'), 
        Interval(name='John', start='12:00', end='13:20'), 
        Interval(name='John', start='13:50', end='14:00'), 
        Interval(name='Tom', start='12:00', end='13:10')
    ]