I am interested in identifying when different trips take place in a dataset. There are two lock states, where lock means the vehicle is stationary and unlocked means that the vehicle is being used.
As the same vehicle could be used by the same user multiple times, I first isolate the vehicle and a unique user through IDs and from a chronologically sorted time date column I can see when the vehicle was used. In order to identify different trips taken in the same vehicle by the same user I thought of identifying through my lock_state variable.
I've been trying to find how this could be done and percolation is something I came across but it seems too complex to understand and implement. I was wondering if there is an easier way of achieving this.
My end goal is to identify the number of trips (should be 2 in this example), add them to a new df alongside the user id and start/end datetimes (let's pretend all of this is the random column) and give them unique IDs. So the final output should be something like this (random made-up example):
trip_id star_time end_time user_id
jk3b4334kjh x x 093723
nbnmvn829nk x x 234380
Assuming the following sample data is in chronological order, how could I identify through the variable state different trips? (there should be 2 trips identified as the array is under continuous "unlocked" state twice before being interrupted by a "locked" state).
lock_state = ["locked", "unlocked", "unlocked", "unlocked", "locked", "locked", "unlocked", "unlocked"]
# should be 2 trips
random_values = random.sample(range(2,20), 8)
df = pd.DataFrame(
{'state': lock_state,
'random': random_values
})
df
>>
state random
0 locked 5
1 unlocked 12
2 unlocked 17
3 unlocked 13
4 locked 18
5 locked 6
6 unlocked 4
7 unlocked 9
I came up with this implementation of a 1D Hoshen-Kopelman cluster labelling.
import random
import pandas as pd
import numpy as np
lock_state = ["locked", "unlocked", "unlocked", "unlocked", "locked", "locked", "unlocked", "unlocked"]
random_values = random.sample(range(2,20), 8)
df = pd.DataFrame(
{'state': lock_state,
'random': random_values
})
def hoshen_kopelman_1d(grid, occupied_label):
"""
Hoshen Kopelman implementation for 1D graphs.
Parameters:
grid (pd.DataFrame): The 1D grid.
ocuppied_label (str): the label that identifies occupied nodes.
Returns:
labeled_grid (pd.DataFrame): grid with cluster labeled nodes.
"""
# create labeled_grid and assign all nodes to cluster 0
labeled_grid = df.assign(cluster=0)
cluster_count = 0
# iterate through the grid's nodes left to right
for index, node in grid.iterrows():
# check if node is occupied
if node["state"] == occupied_label: # node is occupied
if index == 0:
# initialize new cluster
cluster_count += 1
labeled_grid.loc[0, "cluster"] = cluster_count
else:
# check if left-neighbour node is occupied
if labeled_grid.loc[index-1, "cluster"] != 0: # left-neighbour node is occupied
# assign node to the same cluster as left-neighbour node
labeled_grid.loc[index, "cluster"] = labeled_grid.loc[index-1, "cluster"]
else: # left-neighbour node is unoccupied
# initialize new cluster
cluster_count += 1
labeled_grid.loc[index, "cluster"] = cluster_count
return labeled_grid
M = hoshen_kopelman_1d(grid=df, occupied_label="unlocked")
It returns a new pandas.DataFrame
with an extra "cluster"
column, which indicates the cluster to which the node belongs (0
means the node is unoccupied and does not belong to any cluster).
Having this, it becomes pretty straightforward to retrieve the rows from, e.g., trip 1
. We could do
trip_1 = M.loc[M['cluster'] == 1]