I am trying to find a continuous (with strictly increasing values) path through a list of lists. I have tried various recursive and reversed approaches, but have failed for hours.
The problem stems from interval-based pattern mining. Here, each event has exact one start and end time. The problem looks like this:
time = [[11, 38, 40], [12, 39, 49], [41], [4, 23, 43], [47], [17, 35, 60]]
Events = [[start time events A], [end time events A], [start time events B], [start time events C], [end time events B], [end time events C]]
The goal is to find all possible routes through data where
time[i] < time[i+1]
holds.
A concrete example:
route_1 = [[11], [12], [41], [43], [47], [60]]
route_2 = [[38], [39], [41], [43], [47], [60]]
What is not valid:
route_3 = [[11], [39], [41], [43], [47], [60]]
because [11]
describes the start time of Event A[0]
, but [39]
represents the end time of Event A[1]
.
Can you name a suitable approach to solving this problem?
For my last (non-recursive) approach, I used a dictionary as a data representation. The approach produced the "closest" result to the expected result
time_reversed = {'end_C': [17, 35, 60], 'end_B': [47], 'start_C': [ 4, 23, 43], 'start_B': [41], 'end_A': [12, 39, 49], 'start_A': [11, 38, 40]}
import numpy as np
def consecutive(time):
time_pruned = time.copy()
keys = list(time.keys())
for i in range(len(keys)-1):
lower = time_pruned[keys[i+1]]
max_lower = np.nanmax(lower)
max_upper = np.nanmax(time_pruned[keys[i]])
while max_lower > max_upper:
lower = np.delete(lower, np.nanargmax(lower))
max_lower = np.nanmax(lower)
time_pruned[keys[i+1]] = lower
return time_pruned
However, this approach is not valid, since it does not consider the event-affiliation and do not now at the moment, how to consider everything in an efficient way.
The function above yields:
consecutive(time_reversed)
>>>{'end_C': [17, 35, 60],
'end_B': [47],
'start_C': [4, 23, 43],
'start_B': [41],
'end_A': array([12, 39]),
'start_A': array([11, 38])}
Update 1: I've tried to describe the approach more detailed. I also tried to bring it into code but failed while deleting elements of the list while iterating over it.
Start position for an element wise comparison is
i = 0
j = 0
l = data_pruned[keys[i]][j] = 11
r = data_pruned[keys[i+1]][j] = 12
Whereby check()
is defined as l < r
Iteration 1:
11 -> 12 -> 41 -> 4 (check() = False) remove 4 from start_C and 17 from end_C since each event consists of exact one start and end time
data_pruned = { 'start_A': [11, 38, 40],
'end_A': [12, 39, 49],
'start_B': [41],
'start_C': [23, 43],
'end_B': [47],
'end_C': [35, 60]}
Iteration 2:
11 -> 12 -> 41 -> 23 (check() = False) remove 23 from start_C and 35 from end_C since each event consists of exact one start and end time
data_pruned = { 'start_A': [11, 38, 40],
'end_A': [12, 39, 49],
'start_B': [41],
'start_C': [43],
'end_B': [47],
'end_C': [60]}
Iteration 3:
11 -> 12 -> 41 -> 43 -> 47 -> 60 --> Possible route since each value is strictly increasing
-> Do we have other possible strictly increasing combinations/routes/paths in data_pruned
?
Iteration 4:
38 -> 39 -> 41 -> 47 -> 60 --> Possible route since each value is strictly increasing
-> Do we have other possible strictly increasing combinations/routes/paths in data_pruned
?
Iteration 5:
38 -> 39 -> 41 -> 47 -> 60 --> Possible route since each value is strictly increasing
40 -> 49 -> 41 (check() = False) remove 49 from end_a and 40 from start_A since each event consists of exact one start and end time
data_pruned = { 'start_A': [11, 38],
'end_A': [12, 39],
'start_B': [41],
'start_C': [43],
'end_B': [47],
'end_C': [60]}
Do we have other possible strictly increasing combinations/routes/paths in data_pruned
?
Done since all possibilities checked
Update 2: I export the data from database into different data representations such as pandas.
import pandas as pd
from io import StringIO
data = """\
event,occur,start,end
A,1,11,12
A,2,38,39
A,3,40,49
B,1,41,47
C,1,4,17
C,2,23,35
C,3,43,60
"""
# Read the data into a DataFrame
df = pd.read_csv(StringIO(data))
If I understand correctly, you have groups (A, B, C in the example), and each group has intervals (start/end). You want to use exactly one interval from each group, and have a sorted list of the involved times.
A constraint is that the order of times should follow a given pattern. A pattern could for example be A-start, A-end, B-start, C-start, B-end, C-end. Outputs that do not follow this sequence should be excluded.
As preprocessing step, you could organise the input per group, and for each determine the two indices of the route where they will write the start and end index of an interval. These indices are determined by the pattern input. For instance, for the example in the question, we would have:
A: [0, 1] # spans are written to these indices of the output route
B: [2, 4]
C: [3, 5]
Then determine for each output index, which other two indices would then already be filled in and would determine the "window" for the value that can be written to the targeted index. For the above example, the route would be populated in this order (determined by the pattern):
For A, we fill in the start at index 0
For A, we fill in the end at index 1
For B, we fill in the start at index 2
For B, we fill in the end at index 4 # note we skip index 3!
For C, we fill in the start at index 3
For C, we fill in the end at index 5
Now when we are at the step where we fill in a end time for B, we should make sure it is not less than the value stored at index 2. This is always the same index, so we want to calculate this index up front (in its relation to index 4).
Similarly, when we are at the next step, where we fill in a start time for C at index 5, we should make sure it is not more than the value already stored at index 4. So again we want to prepare for this and set up this limit relationship between index 4 and 5.
We can use a stack-based algorithm for this (linear time complexity) and so determine for each index where we have to look to find the other two indices that will already have values and which set the "window" for the current value.
Once that preprocessing is done, we can start the actual DFS search for solutions.
There are more little aspects in this algorithm, but they should be clear from the comments in the code below:
def routes(data, pattern):
# setlimits: Helper function to determine which indices in the route
# determine the range that the value at index i of a route can have
# given the order in which the route is populated (pattern).
def setlimits(pattern):
stack = [-1]
limit = []
for i, key in enumerate(pattern):
while len(stack) > 1 and pattern[stack[-1]] >= key:
stack.pop()
limit.append(stack[-1] + 1)
stack.append(i)
return limit
n = len(pattern) // 2
# Get sorted list of event keys, relying on sorted dict (Python 3.6+)
keymapper = { key : 0 for key in pattern }
# Map each key (string) to a unique sequence number (0..n-1)
keymapper = { key : i for i, key in enumerate(keymapper) }
# Convert the pattern to use these sequence numbers instead of key strings
pattern = [keymapper[key] for key in pattern]
# Construct data structure that gives for each group the two indices
# in a route where the start/end of a span have to be written
schedule = [[-1, 0, []] for _ in range(n)]
# Populate the list of start/end spans for each group...
# NB: the data is assumed to be sorted by seq column per group
for key, seq, start, end in data:
schedule[keymapper[key]][2].append((start, end))
# Populate the two indices for where these spans should be output in a route
for i, key in enumerate(pattern):
schedule[key][schedule[key][0] > -1] = i
inf = float("inf")
# route is a buffer to populate during DFS algorithm
route = [-inf] * (n*2+2) # an extra entry before and after the route
route[-1] = inf
# Prepare dependencies that imply constraints
low = setlimits(pattern)
high = [len(route) - 1 - i for i in setlimits(pattern[::-1])[::-1]]
# Recursive algorithm to assign pairs of start/end times to routes
def dfs(event):
if event >= n: # A route has been completed:
yield route[1:-1] # exclude the dummy entries from the route
return
i, j, spans = schedule[event]
# Determine the "windows" for values at i, and at j in the route
startlow = route[low[i]]
starthigh = route[high[i]]
endlow = route[low[j]]
endhigh = route[high[j]]
# Try each of the spans...
for start, end in spans:
if startlow <= start <= starthigh and endlow <= end <= endhigh:
# This span is acceptable. Write it to the route and recur...
route[i+1] = start
route[j+1] = end
yield from dfs(event + 1)
return list(dfs(0))
Here is how you would run it for the example in the question:
data = [
["A", 1, 11, 12],
["A", 2, 38, 39],
["A", 3, 40, 49],
["B", 1, 41, 47],
["C", 1, 4, 17],
["C", 2, 23, 35],
["C", 3, 43, 60],
]
pattern = ["A","A","B","C","B","C"]
for route in routes(data, pattern):
print(route)
This outputs:
[11, 12, 41, 43, 47, 60]
[38, 39, 41, 43, 47, 60]