Search code examples
pythonpandascalculated-columns

Create a column B that stores the sum of a group in a column A but only at a specific value


I have a dataframe (below) where I read in time series data of when lights were turned and left on and then turned off. I created a series of calculated columns in my dataframe to identify each cluster (continuous "On" clusters). What I could like to do is sum the total length of time of that cluster but only store that value at the earliest time (first "On") of each particular cluster. *** The desired outcome would look like column "Length"

I have all the necessary components to create the cluster but cannot figure out the code to sum the timesteps of that particular cluster and store the value at the earliest time.

ANY help would be greatly appreciated!

Example data table....The desired outcome is the LENGTH column

The data is also here (I WAS UNABLE REPLACE THE 0s IN COLUMNS "CLUSTER" and "LENGTH" with NaNs but They should be. The IMAGE is correct."

import pandas as pd
import numpy as np

data = {'Series':  ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A' ],
        'Time': [0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5],
        'TimeStep': [0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5],
        'Light': ["Off", "Off", "On", "On", "On", "Off", "Off", "Off", "On", "On", "On", "On", "On", "Off", "Off"],
        'Cluster': [0, 0, 1, 1, 1, 0, 0, 0, 1,1, 1, 1, 1, 0, 0],
        'ClusterLength': [0, 0, 1.5, 0, 0, 0, 0, 0, 2.5, 0, 0, 0, 0, 0, 0],              }

df = pd.DataFrame (data, columns = ['Series','Time','TimeStep','Light','Cluster','ClusterLength'])
df

Solution

  • My solution computes both Cluster and ClusterLength columns, just as in your picture, i.e. with empty strings as "absent" values.

    In order not to overwrite your original columns, I named them Clust and ClustLen.

    Start from defining a function generating the cluster length for the current group of rows (grouped by the cluster number):

    def getClustLen(grp):
        rv = pd.Series('', index=grp.index)
        if type(grp.iloc[0].Clust) == int:
            rv.iloc[0] = grp.TimeStep.sum()
        return rv
    

    Then, to generate Clust column, run:

    df['Clust'] = (df.Light.ne(df.Light.shift()) & df.Light.eq('On')).cumsum()
    df.Clust = df.Clust.where(df.Light == 'On', '')
    

    And the final step is to generate ClustLen column:

    df['ClustLen'] = df.groupby('Clust').apply(getClustLen)\
        .reset_index(level=0, drop=True)
    

    The result is:

       Series  Time  TimeStep Light  Cluster  ClusterLength Clust ClustLen
    0       A   0.5       0.5   Off        0            0.0               
    1       A   1.0       0.5   Off        0            0.0               
    2       A   1.5       0.5    On        1            1.5     1      1.5
    3       A   2.0       0.5    On        1            0.0     1         
    4       A   2.5       0.5    On        1            0.0     1         
    5       A   3.0       0.5   Off        0            0.0               
    6       A   3.5       0.5   Off        0            0.0               
    7       A   4.0       0.5   Off        0            0.0               
    8       A   4.5       0.5    On        2            2.5     2      2.5
    9       A   5.0       0.5    On        2            0.0     2         
    10      A   5.5       0.5    On        2            0.0     2         
    11      A   6.0       0.5    On        2            0.0     2         
    12      A   6.5       0.5    On        2            0.0     2         
    13      A   7.0       0.5   Off        0            0.0               
    14      A   7.5       0.5   Off        0            0.0               
    

    In the final version, to save new values under original names, change Clust to Cluster and ClustLen to ClusterLength.