I have this data:
import pandas as pd
data = {
'timestamp': ['2022-11-03 00:00:06', '2022-11-03 00:00:33', '2022-11-03 00:00:35', '2022-11-03 00:00:46', '2022-11-03 00:01:21', '2022-11-03 00:01:30'],
'from': ['A', 'A', 'A', 'A', 'B', 'C'],
'to': ['B', 'B', 'B', 'C', 'C', 'B'],
'type': ['Car', 'Car', 'Van', 'Car', 'HGV', 'Van']
}
df = pd.DataFrame(data)
I want to create two sets of CSVs:
The difference between the two sets is that one will count all FROM items and the other will group them and count them by pairs of FROM and TO.
The output will be an aggregated sum of vehicles of a given type for 15 minute intervals summed up by FROM column and also a combination of FROM and TO column.
1st output can look like this for each vehicle type:
2nd output:
I tried using Pandas groupby()
and resample()
but due to my limited knowledge to no success. I can do this in Excel but very inefficiently. I want to learn Python more and be more efficient, therefore I would like to code it in Pandas.
I tried df.groupby(['FROM', 'TO']).count()
but I lack the knowledge to usit for what I need. I keep either getting error when I do something I should not or the output is not what I need.
I tried df.groupby(pd.Grouper(freq='15Min', )).count()
but it seems I perhaps have incorrect data type.
And I don't know if this is applicable.
If I understand you correctly, one approach could be as follows:
Data
import pandas as pd
# IIUC, you want e.g. '2022-11-03 00:00:06' to be in the `00:15` bucket, we need `to_offset`
from pandas.tseries.frequencies import to_offset
# adjusting last 2 timestamps to get a diff interval group
data = {'timestamp': ['2022-11-03 00:00:06', '2022-11-03 00:00:33',
'2022-11-03 00:00:35', '2022-11-03 00:00:46',
'2022-11-03 00:20:21', '2022-11-03 00:21:30'],
'from': ['A', 'A', 'A', 'A', 'B', 'C'],
'to': ['B', 'B', 'B', 'C', 'C', 'B'],
'type': ['Car', 'Car', 'Van', 'Car', 'HGV', 'Van']}
df = pd.DataFrame(data)
print(df)
timestamp from to type
0 2022-11-03 00:00:06 A B Car
1 2022-11-03 00:00:33 A B Car
2 2022-11-03 00:00:35 A B Van
3 2022-11-03 00:00:46 A C Car
4 2022-11-03 00:20:21 B C HGV
5 2022-11-03 00:21:30 C B Van
# e.g. for FROM we want: `A`, `4` (COUNT), `00:15` (TIME-END)
# e.g. for FROM-TO we want: `A-B`, 3 (COUNT), `00:15` (TIME-END)
# `A-C`, 1 (COUNT), `00:15` (TIME-END)
Code
# convert time strings to datetime and set column as index
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)
# add `15T (== mins) offset to datetime vals
df.index = df.index + to_offset('15T')
# create `dict` for conversion of `col names`
cols = {'timestamp': 'TIME-END', 'from': 'FROM', 'to': 'TO'}
# we're doing basically the same for both outputs, so let's use a for loop on a nested list
nested_list = [['from'],['from','to']]
for item in nested_list:
# groupby `item` (i.e. `['from']` and `['from','to']`)
# use `.agg` to create named output (`COUNT`), applied to `item[0]`, so 2x on: `from`
# and get the `count`. Finally, reset the index
out = df.groupby(item).resample('15T').agg(COUNT=(item[0],'count')).reset_index()
# rename the columns using our `cols` dict
out = out.rename(columns=cols)
# convert timestamps like `'2022-11-03 00:15:00' to `00:15`
out['TIME-END'] = out['TIME-END'].dt.strftime('%H:%M:%S')
# rearrange order of columns; for second `item` we need to include `to` (now: `TO`)
if 'TO' in out.columns:
out = out.loc[:, ['FROM', 'TO', 'COUNT', 'TIME-END']]
else:
out = out.loc[:, ['FROM', 'COUNT', 'TIME-END']]
# write output to `csv file`; e.g. use an `f-string` to customize file name
out.to_csv(f'output_{"_".join(item)}.csv') # i.e. 'output_from', 'output_from_to'
# `index=False` avoids writing away the index
Output (loaded in excel)
Relevant documentation: