Search code examples
pythonpython-3.xrecursion

Finding a continuous route through a list of lists


I am trying to find a continuous (with strictly increasing values) path through a list of lists. I have tried various recursive and reversed approaches, but have failed for hours.

The problem stems from interval-based pattern mining. Here, each event has exact one start and end time. The problem looks like this:

time = [[11, 38, 40], [12, 39, 49], [41], [4, 23, 43], [47], [17, 35, 60]]
Events = [[start time events A], [end time events A], [start time events B], [start time events C], [end time events B], [end time events C]]

The goal is to find all possible routes through data where

time[i] < time[i+1] holds.

A concrete example:

route_1 = [[11], [12], [41], [43], [47], [60]]
route_2 = [[38], [39], [41], [43], [47], [60]]

What is not valid: route_3 = [[11], [39], [41], [43], [47], [60]] because [11] describes the start time of Event A[0], but [39] represents the end time of Event A[1].

Can you name a suitable approach to solving this problem?

For my last (non-recursive) approach, I used a dictionary as a data representation. The approach produced the "closest" result to the expected result

time_reversed = {'end_C': [17, 35, 60], 'end_B': [47], 'start_C': [ 4, 23, 43], 'start_B': [41], 'end_A': [12, 39, 49], 'start_A': [11, 38, 40]}

import numpy as np    
def consecutive(time):
    time_pruned = time.copy()
    keys = list(time.keys())

    for i in range(len(keys)-1):
        lower = time_pruned[keys[i+1]]
        max_lower = np.nanmax(lower)
        max_upper = np.nanmax(time_pruned[keys[i]])

        while max_lower > max_upper:
            lower = np.delete(lower, np.nanargmax(lower))
    
            max_lower = np.nanmax(lower)
            time_pruned[keys[i+1]] = lower

    return time_pruned

However, this approach is not valid, since it does not consider the event-affiliation and do not now at the moment, how to consider everything in an efficient way.

The function above yields:

consecutive(time_reversed)
>>>{'end_C': [17, 35, 60],
 'end_B': [47],
 'start_C': [4, 23, 43],
 'start_B': [41],
 'end_A': array([12, 39]),
 'start_A': array([11, 38])}

Update 1: I've tried to describe the approach more detailed. I also tried to bring it into code but failed while deleting elements of the list while iterating over it.

Start position for an element wise comparison is

i = 0
j = 0
l = data_pruned[keys[i]][j] = 11
r = data_pruned[keys[i+1]][j] = 12

Whereby check() is defined as l < r

Iteration 1:

11 -> 12 -> 41 -> 4 (check() = False) remove 4 from start_C and 17 from end_C since each event consists of exact one start and end time

data_pruned = { 'start_A': [11, 38, 40],
                'end_A': [12, 39, 49],
                'start_B': [41],
                'start_C': [23, 43],
                'end_B': [47],
                'end_C': [35, 60]}

Iteration 2:

11 -> 12 -> 41 -> 23 (check() = False) remove 23 from start_C and 35 from end_C since each event consists of exact one start and end time

data_pruned = { 'start_A': [11, 38, 40],
                'end_A': [12, 39, 49],
                'start_B': [41],
                'start_C': [43],
                'end_B': [47],
                'end_C': [60]}

Iteration 3:

11 -> 12 -> 41 -> 43 -> 47 -> 60 --> Possible route since each value is strictly increasing

-> Do we have other possible strictly increasing combinations/routes/paths in data_pruned?

Iteration 4:

38 -> 39 -> 41 -> 47 -> 60 --> Possible route since each value is strictly increasing

-> Do we have other possible strictly increasing combinations/routes/paths in data_pruned?

Iteration 5:

38 -> 39 -> 41 -> 47 -> 60 --> Possible route since each value is strictly increasing

40 -> 49 -> 41 (check() = False) remove 49 from end_a and 40 from start_A since each event consists of exact one start and end time

data_pruned = { 'start_A': [11, 38],
                'end_A': [12, 39],
                'start_B': [41],
                'start_C': [43],
                'end_B': [47],
                'end_C': [60]}

Do we have other possible strictly increasing combinations/routes/paths in data_pruned?

Done since all possibilities checked

Update 2: I export the data from database into different data representations such as pandas.

import pandas as pd
from io import StringIO

data = """\
event,occur,start,end
A,1,11,12
A,2,38,39
A,3,40,49
B,1,41,47
C,1,4,17
C,2,23,35
C,3,43,60
"""

# Read the data into a DataFrame
df = pd.read_csv(StringIO(data))

Solution

  • If I understand correctly, you have groups (A, B, C in the example), and each group has intervals (start/end). You want to use exactly one interval from each group, and have a sorted list of the involved times.

    A constraint is that the order of times should follow a given pattern. A pattern could for example be A-start, A-end, B-start, C-start, B-end, C-end. Outputs that do not follow this sequence should be excluded.

    Proposed algorithm

    As preprocessing step, you could organise the input per group, and for each determine the two indices of the route where they will write the start and end index of an interval. These indices are determined by the pattern input. For instance, for the example in the question, we would have:

    A: [0, 1]  # spans are written to these indices of the output route
    B: [2, 4]
    C: [3, 5]
    

    Then determine for each output index, which other two indices would then already be filled in and would determine the "window" for the value that can be written to the targeted index. For the above example, the route would be populated in this order (determined by the pattern):

    For A, we fill in the start at index 0
    For A, we fill in the end   at index 1
    For B, we fill in the start at index 2
    For B, we fill in the end   at index 4  # note we skip index 3!
    For C, we fill in the start at index 3
    For C, we fill in the end   at index 5
    

    Now when we are at the step where we fill in a end time for B, we should make sure it is not less than the value stored at index 2. This is always the same index, so we want to calculate this index up front (in its relation to index 4).

    Similarly, when we are at the next step, where we fill in a start time for C at index 5, we should make sure it is not more than the value already stored at index 4. So again we want to prepare for this and set up this limit relationship between index 4 and 5.

    We can use a stack-based algorithm for this (linear time complexity) and so determine for each index where we have to look to find the other two indices that will already have values and which set the "window" for the current value.

    Once that preprocessing is done, we can start the actual DFS search for solutions.

    There are more little aspects in this algorithm, but they should be clear from the comments in the code below:

    Implementation

    def routes(data, pattern):
        # setlimits: Helper function to determine which indices in the route
        #    determine the range that the value at index i of a route can have
        #    given the order in which the route is populated (pattern).
        def setlimits(pattern):
            stack = [-1]
            limit = []
            for i, key in enumerate(pattern):
                while len(stack) > 1 and pattern[stack[-1]] >= key:
                    stack.pop()
                limit.append(stack[-1] + 1)
                stack.append(i)
            return limit
                
        n = len(pattern) // 2
        # Get sorted list of event keys, relying on sorted dict (Python 3.6+)
        keymapper = { key : 0 for key in pattern }
        # Map each key (string) to a unique sequence number (0..n-1)
        keymapper = { key : i for i, key in enumerate(keymapper) }
        # Convert the pattern to use these sequence numbers instead of key strings
        pattern = [keymapper[key] for key in pattern]
        # Construct data structure that gives for each group the two indices 
        #    in a route where the start/end of a span have to be written
        schedule = [[-1, 0, []] for _ in range(n)]
        # Populate the list of start/end spans for each group...
        #     NB: the data is assumed to be sorted by seq column per group
        for key, seq, start, end in data:  
            schedule[keymapper[key]][2].append((start, end))
        # Populate the two indices for where these spans should be output in a route
        for i, key in enumerate(pattern):
            schedule[key][schedule[key][0] > -1] = i
        
        inf = float("inf")
        # route is a buffer to populate during DFS algorithm
        route = [-inf] * (n*2+2)  # an extra entry before and after the route
        route[-1] = inf
    
        # Prepare dependencies that imply constraints
        low = setlimits(pattern)
        high = [len(route) - 1 - i for i in setlimits(pattern[::-1])[::-1]]
    
        # Recursive algorithm to assign pairs of start/end times to routes
        def dfs(event):
            if event >= n: # A route has been completed: 
                yield route[1:-1]  # exclude the dummy entries from the route
                return
            i, j, spans = schedule[event]
            # Determine the "windows" for values at i, and at j in the route
            startlow = route[low[i]]
            starthigh = route[high[i]]
            endlow = route[low[j]]
            endhigh = route[high[j]]
            # Try each of the spans...
            for start, end in spans:
                if startlow <= start <= starthigh and endlow <= end <= endhigh:
                    # This span is acceptable. Write it to the route and recur...
                    route[i+1] = start
                    route[j+1] = end
                    yield from dfs(event + 1)
        
        return list(dfs(0))
    

    Here is how you would run it for the example in the question:

    data = [
        ["A", 1, 11, 12],
        ["A", 2, 38, 39],
        ["A", 3, 40, 49],
        ["B", 1, 41, 47],
        ["C", 1, 4, 17],
        ["C", 2, 23, 35],
        ["C", 3, 43, 60],
    ]
    pattern = ["A","A","B","C","B","C"]
    
    for route in routes(data, pattern):
        print(route)
    

    This outputs:

    [11, 12, 41, 43, 47, 60]
    [38, 39, 41, 43, 47, 60]