Search code examples
pythonpandasgroup-by

Python Pandas grouping DataFrame by blocks of sequential values


I have a large pandas DataFrame having a timestamp column, a value column, a "key" column and a column with a flag indicating end of a block (and some more actually)

timestamp, value, key, last_in_block
12345, 1.0, 2, False <-- start of block 1
12346, 0.5, 4, False
12347, 1.2, 1, False
12348, 2.2, 6, False
12349, 1.5, 3, False
12350, 1.2, 3, False
12351, 2.3, 3, False
12352, 0.4, 5, True
12371, 1.3, 2, False <-- start of block 2
12372, 0.9, 4, False
12373, 1.7, 1, False
12374, 2.0, 6, False
12375, 1.2, 3, False
12376, 1.4, 3, False
12377, 2.7, 3, False
12378, 0.8, 5, True
...

In the "key" column a characteristic sequence of values (in this example 2 4 5 6 3 3 3 5) appears identifying a data block. I would like to apply a "groupBy" statement to break the Frame into groups by blocks. Is this / how is this possible?

df = pd.DataFrame(data =
              {"timestamp": list(range(12345,12353)) + list(range(12371,12379)),
                "value": [1.0,0.5,1.2,2.2,1.5,1.2,2.3,0.4,1.3,0.9,1.7,2.0,1.2,1.4,2.7,0.8],
                "key": [2,4,1,6,3,3,3,5]*2,
               "last_in_block" : [False]*7+[True]+[False]*7+[True]})

Update, answers to questions:

  • I known the sequence in advance
  • The sequences are complete besides very rare exceptions

Solution

  • You can use last_in_block to identify the groups, then use groupby:

    df["last_in_block"] = np.where(df["last_in_block"].shift(), 1, 0)
    df["last_in_block"] = df["last_in_block"].cumsum()
    grouped = df.groupby("last_in_block")
    

    grouped.groups:

    {1: [0, 1, 2, 3, 4, 5, 6, 7], 2: [8, 9, 10, 11, 12, 13, 14, 15]}
    

    grouped.mean():

                   timestamp   value    key
    last_in_block                          
    1                12348.5  1.2875  3.375
    2                12374.5  1.5000  3.375