I have a large pandas DataFrame having a timestamp column, a value column, a "key" column and a column with a flag indicating end of a block (and some more actually)
timestamp, value, key, last_in_block
12345, 1.0, 2, False <-- start of block 1
12346, 0.5, 4, False
12347, 1.2, 1, False
12348, 2.2, 6, False
12349, 1.5, 3, False
12350, 1.2, 3, False
12351, 2.3, 3, False
12352, 0.4, 5, True
12371, 1.3, 2, False <-- start of block 2
12372, 0.9, 4, False
12373, 1.7, 1, False
12374, 2.0, 6, False
12375, 1.2, 3, False
12376, 1.4, 3, False
12377, 2.7, 3, False
12378, 0.8, 5, True
...
In the "key" column a characteristic sequence of values (in this example 2 4 5 6 3 3 3 5) appears identifying a data block. I would like to apply a "groupBy" statement to break the Frame into groups by blocks. Is this / how is this possible?
df = pd.DataFrame(data =
{"timestamp": list(range(12345,12353)) + list(range(12371,12379)),
"value": [1.0,0.5,1.2,2.2,1.5,1.2,2.3,0.4,1.3,0.9,1.7,2.0,1.2,1.4,2.7,0.8],
"key": [2,4,1,6,3,3,3,5]*2,
"last_in_block" : [False]*7+[True]+[False]*7+[True]})
Update, answers to questions:
You can use last_in_block
to identify the groups, then use groupby
:
df["last_in_block"] = np.where(df["last_in_block"].shift(), 1, 0)
df["last_in_block"] = df["last_in_block"].cumsum()
grouped = df.groupby("last_in_block")
grouped.groups
:
{1: [0, 1, 2, 3, 4, 5, 6, 7], 2: [8, 9, 10, 11, 12, 13, 14, 15]}
grouped.mean()
:
timestamp value key
last_in_block
1 12348.5 1.2875 3.375
2 12374.5 1.5000 3.375