python pandas filtering grouping intervals

Python: Identify breaking points in a data frame column

I am interested in identifying when different trips take place in a dataset. There are two lock states, where lock means the vehicle is stationary and unlocked means that the vehicle is being used.

As the same vehicle could be used by the same user multiple times, I first isolate the vehicle and a unique user through IDs and from a chronologically sorted time date column I can see when the vehicle was used. In order to identify different trips taken in the same vehicle by the same user I thought of identifying through my lock_state variable.

I've been trying to find how this could be done and percolation is something I came across but it seems too complex to understand and implement. I was wondering if there is an easier way of achieving this.

My end goal is to identify the number of trips (should be 2 in this example), add them to a new df alongside the user id and start/end datetimes (let's pretend all of this is the random column) and give them unique IDs. So the final output should be something like this (random made-up example):

trip_id      star_time  end_time user_id
jk3b4334kjh  x           x       093723
nbnmvn829nk  x           x       234380

Assuming the following sample data is in chronological order, how could I identify through the variable state different trips? (there should be 2 trips identified as the array is under continuous "unlocked" state twice before being interrupted by a "locked" state).

lock_state = ["locked", "unlocked", "unlocked", "unlocked", "locked", "locked", "unlocked", "unlocked"]
# should be 2 trips

random_values = random.sample(range(2,20), 8) 

df = pd.DataFrame(
    {'state': lock_state,
     'random': random_values
    })

df

>>
    state   random
0   locked      5
1   unlocked    12
2   unlocked    17
3   unlocked    13
4   locked      18
5   locked      6
6   unlocked    4
7   unlocked    9

Solution

I came up with this implementation of a 1D Hoshen-Kopelman cluster labelling.

import random
import pandas as pd
import numpy as np

lock_state = ["locked", "unlocked", "unlocked", "unlocked", "locked", "locked", "unlocked", "unlocked"]

random_values = random.sample(range(2,20), 8) 

df = pd.DataFrame(
    {'state': lock_state,
     'random': random_values
    })
    

def hoshen_kopelman_1d(grid, occupied_label):
    """
    Hoshen Kopelman implementation for 1D graphs.
    
    Parameters:
            grid (pd.DataFrame): The 1D grid. 
            ocuppied_label (str): the label that identifies occupied nodes.

    Returns:
            labeled_grid (pd.DataFrame): grid with cluster labeled nodes.
    """
    
    # create labeled_grid and assign all nodes to cluster 0
    labeled_grid = df.assign(cluster=0)
    cluster_count = 0
    
    # iterate through the grid's nodes left to right
    for index, node in grid.iterrows():
        # check if node is occupied
        if node["state"] == occupied_label: # node is occupied
            if index == 0:
                # initialize new cluster
                cluster_count += 1
                labeled_grid.loc[0, "cluster"] = cluster_count
            else:
                # check if left-neighbour node is occupied
                if labeled_grid.loc[index-1, "cluster"] != 0: # left-neighbour node is occupied
                    # assign node to the same cluster as left-neighbour node
                    labeled_grid.loc[index, "cluster"] = labeled_grid.loc[index-1, "cluster"]
                else: # left-neighbour node is unoccupied
                    # initialize new cluster
                    cluster_count += 1
                    labeled_grid.loc[index, "cluster"] = cluster_count
                    
    return labeled_grid
                

M = hoshen_kopelman_1d(grid=df, occupied_label="unlocked")

It returns a new pandas.DataFrame with an extra "cluster" column, which indicates the cluster to which the node belongs (0 means the node is unoccupied and does not belong to any cluster).

Having this, it becomes pretty straightforward to retrieve the rows from, e.g., trip 1. We could do

trip_1 = M.loc[M['cluster'] == 1]