Search code examples
pythonpandasdataframeaggregate

Check if timestamps column values are in data range for each group


I would like to achieve the following with python pandas.

Suppose I have the following tables:

Events table

id event capture_date 
1  Up   '2021-01-10'  
1  Down '2021-03-25'
       ...  
1  Up   '2021-08-01'  
2  Up   '2021-06-02'  
2  Down '2021-09-15'  
2  Up   '2021-11-03' 
2  Down '2021-11-05'
       ...
10  Down'2020-06-02'  
10  Up  '2020-09-15'  
10  Down'2021-10-03' 
10  Up  '2021-11-09'

Activity table

id activity_date 
1  '2021-01-12'  
1  '2021-03-28'  
1  '2021-08-04'  
2  '2021-06-30'  
2  '2021-11-01'  
2  '2021-11-03'
10 '2020-01-03'  

For each id group, I would like to add a third column to the "Activity" table" that depends on the "capture_date" from the "Events" table. For example, for id =1: if the the activity_date falls between Up and Down events, then write 0, otherwise 1. And this should be done for every id group.

Thanks!


Solution

  • This solution uses a package called staircase which is built on pandas and numpy for step functions. The data you have describes step functions. The functions step "Up" and "Down" as you have labelled it. We'll assume this means stepping up to a value of 1 and down to a value of 0, and then switch these values in line with your spec.

    setup

    import pandas as pd
    
    events = pd.DataFrame(
        {
            "id":[1,1,1,2,2,2],
            "event":["Up", "Down", "Up", "Up", "Down", "Up"],
            "capture_date":["2021-01-10", "2021-03-25", "2021-08-01", "2021-06-02", "2021-09-15", "2021-11-03"],
        }
    )
    
    activities = pd.DataFrame(
        {
            "id":[1,1,1,2,2,2],
            "activity_date":["2021-01-12", "2021-03-28", "2021-08-04", "2021-06-30", "2021-11-01", "2021-11-03"],
        }
    )
    
    events["capture_date"] = pd.to_datetime(events["capture_date"])
    activities["activity_date"] = pd.to_datetime(activities["activity_date"])
    

    solution

    A stepfunction is represented by the staircase.Stairs class. It takes vectors of start and end times (up and down in your case). These times do not need to be paired, or equal in count. If there is no matching "down" after an "up" then this simply means the step function will continue on with a value of 1 indefinitely as it heads towards infinity. . Likewise, if there is no matching "up" then the step function will have a value of 1 indefinitely as it heads towards negative infinity. We'll group the events dataframe by id and use the data to construct a step function for each id

    import staircase as sc
    
    stepfunctions = events.groupby("id").apply(lambda df: 
        sc.Stairs(
            start=df.query("event == 'Up'").capture_date,
            end=df.query("event == 'Down'").capture_date,
        )
    )
    

    stepfunctions is a pandas.Series, indexed by id and the values are staircase.Stairs objects

    id
    1    <staircase.Stairs, id=2209708361352>
    2    <staircase.Stairs, id=2209708432264>
    dtype: object
    

    To handle the case where there are id values in events which do not appear in activities we'll manually add zero-valued step functions for these ids

    for i in pd.Index(activities["id"]).difference(stepfunctions.index):
        stepfunctions[i] = sc.Stairs()
    

    You can do many things with step functions, including visualising

    stepfunctions[1].plot(style="hlines")
    

    enter image description here

    We can sample the step functions eg

    stepfunctions[1].sample("2021-01-12", include_index=True)
    

    which gives you

    2021-01-12    1
    dtype: int64
    

    These step functions are currently 1 between up and down, and 0 between down and up. It seems like you want the opposite. This can be done by flipping these boolean valued step functions with staircase.Stairs.invert

    Putting it all together, with the help of pandas.concat we have

    result = (
        pd.concat(
            [stepfunctions[id].invert().sample(df["activity_date"], include_index=True) for id, df in activities.groupby("id")]
        )  # a pandas.Series
        .astype(int)  # convert integer floats
        .rename("Activity")  # renames the Series to your required name
        .reset_index() # turns activity date from index to column
        .assign(id=activities["id"])  # inserts id column
    )
    

    result will be a dataframe

      activity_date  Activity  id
    0    2021-01-12         0   1
    1    2021-03-28         1   1
    2    2021-08-04         0   1
    3    2021-06-30         0   2
    4    2021-11-01         1   2
    5    2021-11-03         0   2
    

    note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.