Search code examples
pythonpandasfilteringgroupingintervals

Python: Identify breaking points in a data frame column


I am interested in identifying when different trips take place in a dataset. There are two lock states, where lock means the vehicle is stationary and unlocked means that the vehicle is being used.

As the same vehicle could be used by the same user multiple times, I first isolate the vehicle and a unique user through IDs and from a chronologically sorted time date column I can see when the vehicle was used. In order to identify different trips taken in the same vehicle by the same user I thought of identifying through my lock_state variable.

I've been trying to find how this could be done and percolation is something I came across but it seems too complex to understand and implement. I was wondering if there is an easier way of achieving this.

My end goal is to identify the number of trips (should be 2 in this example), add them to a new df alongside the user id and start/end datetimes (let's pretend all of this is the random column) and give them unique IDs. So the final output should be something like this (random made-up example):

trip_id      star_time  end_time user_id
jk3b4334kjh  x           x       093723
nbnmvn829nk  x           x       234380

Assuming the following sample data is in chronological order, how could I identify through the variable state different trips? (there should be 2 trips identified as the array is under continuous "unlocked" state twice before being interrupted by a "locked" state).

lock_state = ["locked", "unlocked", "unlocked", "unlocked", "locked", "locked", "unlocked", "unlocked"]
# should be 2 trips

random_values = random.sample(range(2,20), 8) 

df = pd.DataFrame(
    {'state': lock_state,
     'random': random_values
    })

df

>>
    state   random
0   locked      5
1   unlocked    12
2   unlocked    17
3   unlocked    13
4   locked      18
5   locked      6
6   unlocked    4
7   unlocked    9


Solution

  • I came up with this implementation of a 1D Hoshen-Kopelman cluster labelling.

    import random
    import pandas as pd
    import numpy as np
    
    lock_state = ["locked", "unlocked", "unlocked", "unlocked", "locked", "locked", "unlocked", "unlocked"]
    
    random_values = random.sample(range(2,20), 8) 
    
    df = pd.DataFrame(
        {'state': lock_state,
         'random': random_values
        })
        
    
    def hoshen_kopelman_1d(grid, occupied_label):
        """
        Hoshen Kopelman implementation for 1D graphs.
        
        Parameters:
                grid (pd.DataFrame): The 1D grid. 
                ocuppied_label (str): the label that identifies occupied nodes.
    
        Returns:
                labeled_grid (pd.DataFrame): grid with cluster labeled nodes.
        """
        
        # create labeled_grid and assign all nodes to cluster 0
        labeled_grid = df.assign(cluster=0)
        cluster_count = 0
        
        # iterate through the grid's nodes left to right
        for index, node in grid.iterrows():
            # check if node is occupied
            if node["state"] == occupied_label: # node is occupied
                if index == 0:
                    # initialize new cluster
                    cluster_count += 1
                    labeled_grid.loc[0, "cluster"] = cluster_count
                else:
                    # check if left-neighbour node is occupied
                    if labeled_grid.loc[index-1, "cluster"] != 0: # left-neighbour node is occupied
                        # assign node to the same cluster as left-neighbour node
                        labeled_grid.loc[index, "cluster"] = labeled_grid.loc[index-1, "cluster"]
                    else: # left-neighbour node is unoccupied
                        # initialize new cluster
                        cluster_count += 1
                        labeled_grid.loc[index, "cluster"] = cluster_count
                        
        return labeled_grid
                    
    
    M = hoshen_kopelman_1d(grid=df, occupied_label="unlocked")
    

    It returns a new pandas.DataFrame with an extra "cluster" column, which indicates the cluster to which the node belongs (0 means the node is unoccupied and does not belong to any cluster).

    Having this, it becomes pretty straightforward to retrieve the rows from, e.g., trip 1. We could do

    trip_1 = M.loc[M['cluster'] == 1]