I have the folowing table
TimeStamp | Name | Marks | Subject |
---|---|---|---|
2022-01-01 00:00:02.969 | Chris | 70 | DK |
2022-01-01 00:00:04.467 | Chris | 75 | DK |
2022-01-01 00:00:05.965 | Mark | 80 | DK |
2022-01-01 00:00:08.962 | Cuban | 60 | DK |
2022-01-01 00:00:10.461 | Cuban | 58 | DK |
I want to aggregate the table for each column into 20minute aggregate which includes max, min, values
Expected output
TimeStamp | Subject | Chris_Min | Chris_Max | Chris_STD | Mark_Min | Mark_Max | Mark_STD |
---|---|---|---|---|---|---|---|
2022-01-01 00:00:00.000 | DK | 70 | 75 | ||||
2022-01-01 00:20:00.000 | DK | etc | etc | ||||
2022-01-01 00:40:00.000 | DK | etc | etc |
I am having hard time aggregating the data into required output. The agggregation should be dynamic so as to change to 10min or 30min.
I tried using bins to do it, but not getting the desired results.
Please Help.
You could try the following:
rule = "10min"
result = (
df.set_index("TimeStamp").groupby(["Name", "Subject"])
.resample(rule)
.agg(Min=("Marks", "min"), Max=("Marks", "max"), STD=("Marks", "std"))
.unstack(0)
.swaplevel(0, 1).reset_index()
)
TimeStamp
as index, and grouping by Subject
and Name
to get the right chunks to work on..resampling()
the groups with the given frequency rule
..agg()
with named tuples.Name
) to get it in the columns.Result for the given sample:
TimeStamp Subject Min Max STD
Name Chris Cuban Mark Chris Cuban Mark Chris Cuban Mark
0 2022-01-01 DK 70 58 80 75 60 80 3.535534 1.414214 NaN
If you want the columns exactly like in your expected output then you could add the following
result = result[
list(result.columns[:2]) + sorted(result.columns[2:], key=lambda c: c[1])
]
result.columns = [f"{lev1}_{lev0}" if lev1 else lev0 for lev0, lev1 in result.columns]
to get
TimeStamp Subject Chris_Min Chris_Max ... Cuban_STD Mark_Min Mark_Max Mark_STD
0 2022-01-01 DK 70 75 ... 1.414214 80 80 NaN
If you're getting the TypeError: aggregate() missing 1 required positional argument...
error (the comment is gone), then it could be that you're working with an older Pandas version that can't deal with named tuples. You could try the following instead:
rule = "10min"
result = (
df.set_index("TimeStamp").groupby(["Name", "Subject"])
.resample(rule)
.agg({"Marks": ["min", "max", "std"]})
.droplevel(0, axis=1)
.unstack(0)
.swaplevel(0, 1).reset_index()
)
...