Search code examples
pythonpython-3.xpandasdataframepandas-groupby

How to lump repeated values into the same interval?


I have a dataset similar to the below that is spatial. Has an "ID", "Assay" that comes from an analytical machine, From interval and To interval. I want to have it look by ID and Assay as it goes down from top to bottom, and find the repeated Assay values and lump them if they are repeating (right after the other). I tried using groupby and aggregate but ended up with it lumping any Assay value that was similar together, I only want it to put together if its back to back. Hopefully the example below makes sense. Thanks in advance! The result is what I want, but the code won't get me that.

import pandas as pd
df = pd.DataFrame({
    "ID":    [ 1, 1, 1, 1, 2, 2, 3, 3, 5, 5, 5, 5],
    "Assay": [ 3, 3, 4, 3, 3, 6, 4, 4, 1, 1, 2, 2],
    "From":  [ 7, 8, 9,10, 0, 8,12,15, 0, 5,10,15],
    "To":    [13,14,15,16,17,18,13,100,5,10,15,25]
})
result = df.groupby(["ID", "Assay"]).agg({"From":['first'], "To":['last']})

Expected output:

          From   To
         first last
ID Assay
1  3         7   14
   4         9   15
   3        10   16
2  3         0   17
   6         8   18
3  4        12  100
5  1         0    5
   1         5   10
   2        10   15
   2        15   25

Solution

  • We could use diff+ne+cumsum to create groups out of consecutive Assays; then filter the Assays that are greater than 3 and use groupby.agg for aggregation. Then concatenate this result with the rows that were filtered for the final output:

    df['groups'] = df['Assay'].diff().ne(0).cumsum()
    msk = df['Assay'].ge(3)
    tmp = (df[msk].groupby(['ID','Assay', 'groups'], sort=False)
           .agg({'From':'first', 'To':'last'}).reset_index())
    out = pd.concat((tmp, df[~msk])).sort_values('groups').drop(columns='groups').reset_index(drop=True)
    

    Output:

       ID  Assay  From   To
    0   1      3     7   14
    1   1      4     9   15
    2   1      3    10   16
    3   2      3     0   17
    4   2      6     8   18
    5   3      4    12  100
    6   5      1     0    5
    7   5      1     5   10
    8   5      2    10   15
    9   5      2    15   25