Search code examples
pythonpython-3.xpandasgroup-byaggregate

Group by time intervals and additional attribute


I have this data:

enter image description here

import pandas as pd

data = {
    'timestamp': ['2022-11-03 00:00:06', '2022-11-03 00:00:33', '2022-11-03 00:00:35', '2022-11-03 00:00:46', '2022-11-03 00:01:21', '2022-11-03 00:01:30'],
    'from': ['A', 'A', 'A', 'A', 'B', 'C'],
    'to': ['B', 'B', 'B', 'C', 'C', 'B'],
    'type': ['Car', 'Car', 'Van', 'Car', 'HGV', 'Van']
}

df = pd.DataFrame(data)

I want to create two sets of CSVs:

  1. One CSV for each Type of vehicles (8 in total) where the rows will by grouped by / aggregated by time-stamp (for 15 minute intervals throughout the day) and by "FROM" column - there will be no "TO" column here.
  2. One CSV for each Type of vehicles (8 in total) where the rows will by grouped by / aggregated by time-stamp (for 15 minute intervals throughout the day), by "FROM" column and by "TO" column.

The difference between the two sets is that one will count all FROM items and the other will group them and count them by pairs of FROM and TO.

The output will be an aggregated sum of vehicles of a given type for 15 minute intervals summed up by FROM column and also a combination of FROM and TO column.

1st output can look like this for each vehicle type:

enter image description here

2nd output:

enter image description here

I tried using Pandas groupby() and resample() but due to my limited knowledge to no success. I can do this in Excel but very inefficiently. I want to learn Python more and be more efficient, therefore I would like to code it in Pandas.

I tried df.groupby(['FROM', 'TO']).count() but I lack the knowledge to usit for what I need. I keep either getting error when I do something I should not or the output is not what I need.

I tried df.groupby(pd.Grouper(freq='15Min', )).count() but it seems I perhaps have incorrect data type.

And I don't know if this is applicable.


Solution

  • If I understand you correctly, one approach could be as follows:

    Data

    import pandas as pd
    
    # IIUC, you want e.g. '2022-11-03 00:00:06' to be in the `00:15` bucket, we need `to_offset`
    from pandas.tseries.frequencies import to_offset
    
    # adjusting last 2 timestamps to get a diff interval group
    data = {'timestamp': ['2022-11-03 00:00:06', '2022-11-03 00:00:33', 
                          '2022-11-03 00:00:35', '2022-11-03 00:00:46', 
                          '2022-11-03 00:20:21', '2022-11-03 00:21:30'], 
            'from': ['A', 'A', 'A', 'A', 'B', 'C'],
            'to': ['B', 'B', 'B', 'C', 'C', 'B'],
            'type': ['Car', 'Car', 'Van', 'Car', 'HGV', 'Van']}
    
    df = pd.DataFrame(data)
    
    print(df)
    
                 timestamp from to type
    0  2022-11-03 00:00:06    A  B  Car
    1  2022-11-03 00:00:33    A  B  Car
    2  2022-11-03 00:00:35    A  B  Van
    3  2022-11-03 00:00:46    A  C  Car
    4  2022-11-03 00:20:21    B  C  HGV
    5  2022-11-03 00:21:30    C  B  Van
    
    # e.g. for FROM we want:        `A`, `4` (COUNT), `00:15` (TIME-END)
    # e.g. for FROM-TO we want:     `A-B`, 3 (COUNT), `00:15` (TIME-END)
    #                               `A-C`, 1 (COUNT), `00:15` (TIME-END)
    

    Code

    # convert time strings to datetime and set column as index
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df.set_index('timestamp', inplace=True)
    
    # add `15T (== mins) offset to datetime vals
    df.index = df.index + to_offset('15T')
    
    # create `dict` for conversion of `col names`
    cols = {'timestamp': 'TIME-END', 'from': 'FROM', 'to': 'TO'}
    
    # we're doing basically the same for both outputs, so let's use a for loop on a nested list
    nested_list = [['from'],['from','to']]
    
    for item in nested_list:
        # groupby `item` (i.e. `['from']` and `['from','to']`)
        # use `.agg` to create named output (`COUNT`), applied to `item[0]`, so 2x  on: `from`
        # and get the `count`. Finally, reset the index
        out = df.groupby(item).resample('15T').agg(COUNT=(item[0],'count')).reset_index()
        
        # rename the columns using our `cols` dict
        out = out.rename(columns=cols)
        
        # convert timestamps like `'2022-11-03 00:15:00' to `00:15`
        out['TIME-END'] = out['TIME-END'].dt.strftime('%H:%M:%S')
        
        # rearrange order of columns; for second `item` we need to include `to` (now: `TO`)
        if 'TO' in out.columns:
            out = out.loc[:, ['FROM', 'TO', 'COUNT', 'TIME-END']]
        else:
            out = out.loc[:, ['FROM', 'COUNT', 'TIME-END']]
            
        # write output to `csv file`; e.g. use an `f-string` to customize file name
        out.to_csv(f'output_{"_".join(item)}.csv') # i.e. 'output_from', 'output_from_to'
        # `index=False` avoids writing away the index
    

    Output (loaded in excel)

    result loaded in excel

    Relevant documentation: