Search code examples
pythonpandaspandas-groupbycumulative-sumbins

Is there a way to put running cumulative data into discrete time bins by group?


Beginning with something like this:

SUBJECT  TIME   C_INTAKE
C02      0.00    0.0
C02      48.00   0.2
C02      49.31   0.4 
C02      51.20   0.4
C02      52.61   0.4
C02      55.61   0.6 
C02      82.77   0.9
C02      84.97   1.1
C02      86.39   1.1
.
.
C03      0.00     0.0

I'd like to:

  • make set time bins (e.g. 10, 20, 30, 40, 50, 60, 70, 80, 90, 100); (in reality I have 92 900-s bins)
  • record cumulative intake for each of those time bins
  • and start again for each subject
    SUBJECT  BINTIME C_INTAKE2  
    C02      0.00    0.0
    C02      10.00   0.0
    C02      20.00   0.0 
    C02      30.00   0.0
    C02      40.00   0.0
    C02      50.00   0.4 
    C02      60.00   0.6
    C02      70.00   0.6
    C02      80.00   0.6
    C02      90.00   1.1
    C02     100.00   1.1
    .
    .
    C03      0.00     0.0

Thank you for your advice

EDIT: Thank you for your input so far. In trying to adapt it to a larger data set, I've come across an error. I am brand new to this (but learning from this forum!) so I'm afraid I'm not able to troubleshoot it on my own.

import pandas as pd
import numpy as np

# Import data from url
url = "https://raw.githubusercontent.com/yadatree/AL/main/intake.csv"
intake=pd.read_csv(url)

# select data
subject = intake['SUBJECT']
time = intake['NTIME']
c_intake = intake['cintake']

df = pd.DataFrame({'subject': subject, 'time': time, 'c_intake': c_intake})

# Create bins
bins = list(range(0,82801, 92))

# calculate cumulative sum by subject and bin
df.groupby([df'subject, pd.cut(df.time, bins)]) \
    .agg({'c_intake': 'sum'}) \
    .groupby('subject') \
    .cumsum() \
    .reset_index()

Ouput:

File "<ipython-input-32-ffea10b96606>", line 19
    df.groupby([df'subject, pd.cut(df.time, bins)])     .agg({'c_intake': 'sum'})     .groupby('subject')     .cumsum()     .reset_index()
                                                              ^
SyntaxError: invalid syntax


Solution

  • Assuming you want to sum c_intake within a subject and bin group, the following solution should work.

    import pandas as pd
    import numpy as np
    
    # Create sample data
    subjects = [item for subject in ['C02', 'C03', 'C04'] for item in [subject] * 10]
    time = np.random.randint(0,100, 30)
    c_intake = np.random.rand(30)
    
    df = pd.DataFrame({'subject': subjects, 'time': time, 'c_intake': c_intake})
    
    # Create desired bins
    bins = list(range(0,101, 10))
    
    # calculate cumulative sum by subject and bin
    df.groupby(['subject', pd.cut(df.time, bins)]) \
        .agg({'c_intake': 'sum'}) \
            .groupby('subject') \
            .cumsum() \
            .reset_index()