Search code examples
pythonpandasdata-cleaning

Need to aggregate data in pandas data frame


Below is the data set

DataSet

I want to merge the data set based on " UUID" to get one row with start and end time as first of start time and end of end time All other numeric row to be summed up


Solution

  • Dataframe:

    import pandas as pd
    
    df=pd.DataFrame({
            'trip_creation_time':['37:08.9', '37:08.9', '37:08.9'],
            'UUID':['ABC-1', 'ABC-1', 'ABC-1'],
            'RouteType':['C', 'C', 'C'],
            'tripid':['trip-15', 'trip-15', 'trip-15'],
            'sourcecenter':['AAC', 'AAC', 'AAC'],
            'source_name': ['H_TS_1', 'H_TS_2', 'H_TS_3'],
            'destination_center': ['AAC', 'AAC', 'AAA'],
            'destination_name': ['M-TS_1', 'M-TS_2', 'M-TS_3'],
            'starttime':['05:52.6', '05:52.6', '05:34.8'],
            'endtime':['05:34.8', '05:34.8', '35:54.0'],
            'endtoend_time': [59, 59, 210],
            'actualdistance': [9.684842238, 14.85920056, 9.499885115],
            'actualtime':[14, 30, 165],
            'otime':[8, 12, 14],
            'distance':[10.3638, 16.0049, 14.4597]
            
        })
    

    Group By UUID and aggregate the columns by passing the function you want to run on each column

    first of starttime and last of endtime and summing up column distance and otime

    df.groupby('UUID').agg(starttime_first=('starttime','first'),
                          endtime_last=('starttime','last'),
                          distance_sum = ('distance', 'sum'),
                          otime_sum = ('otime', 'sum'))
    

    enter image description here