python python-3.x pandas group-by aggregate

Group by time intervals and additional attribute

I have this data:

import pandas as pd

data = {
    'timestamp': ['2022-11-03 00:00:06', '2022-11-03 00:00:33', '2022-11-03 00:00:35', '2022-11-03 00:00:46', '2022-11-03 00:01:21', '2022-11-03 00:01:30'],
    'from': ['A', 'A', 'A', 'A', 'B', 'C'],
    'to': ['B', 'B', 'B', 'C', 'C', 'B'],
    'type': ['Car', 'Car', 'Van', 'Car', 'HGV', 'Van']
}

df = pd.DataFrame(data)

I want to create two sets of CSVs:

One CSV for each Type of vehicles (8 in total) where the rows will by grouped by / aggregated by time-stamp (for 15 minute intervals throughout the day) and by "FROM" column - there will be no "TO" column here.
One CSV for each Type of vehicles (8 in total) where the rows will by grouped by / aggregated by time-stamp (for 15 minute intervals throughout the day), by "FROM" column and by "TO" column.

The difference between the two sets is that one will count all FROM items and the other will group them and count them by pairs of FROM and TO.

The output will be an aggregated sum of vehicles of a given type for 15 minute intervals summed up by FROM column and also a combination of FROM and TO column.

1st output can look like this for each vehicle type:

2nd output:

I tried using Pandas groupby() and resample() but due to my limited knowledge to no success. I can do this in Excel but very inefficiently. I want to learn Python more and be more efficient, therefore I would like to code it in Pandas.

I tried df.groupby(['FROM', 'TO']).count() but I lack the knowledge to usit for what I need. I keep either getting error when I do something I should not or the output is not what I need.

I tried df.groupby(pd.Grouper(freq='15Min', )).count() but it seems I perhaps have incorrect data type.

And I don't know if this is applicable.

Solution

If I understand you correctly, one approach could be as follows:

Data

import pandas as pd

# IIUC, you want e.g. '2022-11-03 00:00:06' to be in the `00:15` bucket, we need `to_offset`
from pandas.tseries.frequencies import to_offset

# adjusting last 2 timestamps to get a diff interval group
data = {'timestamp': ['2022-11-03 00:00:06', '2022-11-03 00:00:33', 
                      '2022-11-03 00:00:35', '2022-11-03 00:00:46', 
                      '2022-11-03 00:20:21', '2022-11-03 00:21:30'], 
        'from': ['A', 'A', 'A', 'A', 'B', 'C'],
        'to': ['B', 'B', 'B', 'C', 'C', 'B'],
        'type': ['Car', 'Car', 'Van', 'Car', 'HGV', 'Van']}

df = pd.DataFrame(data)

print(df)

             timestamp from to type
0  2022-11-03 00:00:06    A  B  Car
1  2022-11-03 00:00:33    A  B  Car
2  2022-11-03 00:00:35    A  B  Van
3  2022-11-03 00:00:46    A  C  Car
4  2022-11-03 00:20:21    B  C  HGV
5  2022-11-03 00:21:30    C  B  Van

# e.g. for FROM we want:        `A`, `4` (COUNT), `00:15` (TIME-END)
# e.g. for FROM-TO we want:     `A-B`, 3 (COUNT), `00:15` (TIME-END)
#                               `A-C`, 1 (COUNT), `00:15` (TIME-END)

Code

# convert time strings to datetime and set column as index
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)

# add `15T (== mins) offset to datetime vals
df.index = df.index + to_offset('15T')

# create `dict` for conversion of `col names`
cols = {'timestamp': 'TIME-END', 'from': 'FROM', 'to': 'TO'}

# we're doing basically the same for both outputs, so let's use a for loop on a nested list
nested_list = [['from'],['from','to']]

for item in nested_list:
    # groupby `item` (i.e. `['from']` and `['from','to']`)
    # use `.agg` to create named output (`COUNT`), applied to `item[0]`, so 2x  on: `from`
    # and get the `count`. Finally, reset the index
    out = df.groupby(item).resample('15T').agg(COUNT=(item[0],'count')).reset_index()
    
    # rename the columns using our `cols` dict
    out = out.rename(columns=cols)
    
    # convert timestamps like `'2022-11-03 00:15:00' to `00:15`
    out['TIME-END'] = out['TIME-END'].dt.strftime('%H:%M:%S')
    
    # rearrange order of columns; for second `item` we need to include `to` (now: `TO`)
    if 'TO' in out.columns:
        out = out.loc[:, ['FROM', 'TO', 'COUNT', 'TIME-END']]
    else:
        out = out.loc[:, ['FROM', 'COUNT', 'TIME-END']]
        
    # write output to `csv file`; e.g. use an `f-string` to customize file name
    out.to_csv(f'output_{"_".join(item)}.csv') # i.e. 'output_from', 'output_from_to'
    # `index=False` avoids writing away the index

Output (loaded in excel)

Relevant documentation: