Search code examples
pythonpython-itertools

Obtaining all sequences of tuples satisfying position criteria by combining subsets of tuples in a list


I have a list of tuples with each tuple carrying the following information:

(start_position,end_position,list of strings)

A sample list is given as follows:

aList = [(6, 9, ['ataH']), 
(4, 9, ['svataH']), 
(0, 9, ['vEvasvataH']), 
(2, 5, ['vasu', 'vasU']), 
(1, 3, ['Eva', 'eva']), 
(0, 1, ['vA', 'vE'])]

I need to find all sequences of tuples such that each sequence has to cover all positions from the start_position to end_position, in this case from 0 to 9. In a sequence, say a, adjacent tuples need to satisfy the constraints that a[i+1][0] - a[i][1] <= 1.

Summarily, the output should be as follows:

[[(0, 1, ['vA', 'vE']), (2,5,['vasu', 'vasU']), (6, 9, ['ataH'])  ],
 [(0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (4, 9, ['svataH'])],
 [(0, 9, ['vEvasvataH'], [7])]]

I have used the following code to achieve the same.

maxVal = max(aList,key=lambda item:item[1])[1]
diffSets = list()

for i,item in enumerate(aList):
    if maxVal == item[1]:  
        _temp = [item]        
        currStart = item[0]
        for j,stuff in enumerate(aList):
            if i != j:
                if currStart == stuff[1] or currStart == stuff[1]+1:                    
                    _temp.append(stuff)
                    currStart = stuff[0]        
        diffSets.append(_temp) #the output needs to be reversed to get the sequence in the order as shown above

Is there a more efficient and faster way to achieve the same, say using itertools?


Solution

  • Here's how I would go about it (not to say that it's necessarily faster). First, you can start by sorting the data based on both start and end. This will mean when we look at combinations later we won't have to backtrack to other values in our result (we will know that the start of entry[i] must be less than or equal to the start of entry[i+1])

    import itertools
    import operator
    
    data = [
        (6, 9, ['ataH']),
        (4, 9, ['svataH']),
        (0, 9, ['vEvasvataH']),
        (2, 5, ['vasu', 'vasU']),
        (1, 3, ['Eva', 'eva']),
        (0, 1, ['vA', 'vE'])
    ]
    
    data = sorted(data, key=operator.itemgetter(0, 1))
    start = min(data, key=operator.itemgetter(0))[0]
    end = max(data, key=operator.itemgetter(1))[1]
    

    Now we have the data sorted, and know our start and end values.

    To find all of the combinations of the data for any size of subset, we can use this trick: https://stackoverflow.com/a/5898031/3280538

    def all_combinations(data):
        for i in range(len(data)):
            yield from itertools.combinations(data, r=i+1)
    

    Here we use yield to avoid creating expensive containers. Now we write another method that uses these combinations, and for each one checks it's validity:

    def valid_combinations(data):
        for comb in all_combinations(data):
            if comb[0][0] != start or comb[-1][1] != end:
                continue
        
            for i in range(len(comb) - 1):
                if not 0 <= comb[i+1][0] - comb[i][1] <= 1:
                    break
            else:
                yield comb
    

    Here I'm using a neat trick with the for-loop. The else block on the loop will only execute if the for loop completes naturally and does not break, and if we don't break then we know each item is valid.

    All together:

    import itertools
    import operator
    
    
    def all_combinations(data):
        for i in range(len(data)):
            yield from itertools.combinations(data, r=i+1)
    
    
    def valid_combinations(data):
        data = sorted(data, key=operator.itemgetter(0, 1))
        start = min(data, key=operator.itemgetter(0))[0]
        end = max(data, key=operator.itemgetter(1))[1]
    
        for comb in all_combinations(data):
            if comb[0][0] != start or comb[-1][1] != end:
                continue
    
            for i in range(len(comb) - 1):
                if not 0 <= comb[i+1][0] - comb[i][1] <= 1:
                    break
            else:
                yield comb
    

    Getting the results:

    from pprint import pprint
    pprint(list(valid_combinations(
        [
            (6, 9, ['ataH']),
            (4, 9, ['svataH']),
            (0, 9, ['vEvasvataH']),
            (2, 5, ['vasu', 'vasU']),
            (1, 3, ['Eva', 'eva']),
            (0, 1, ['vA', 'vE'])
        ]
    )))
    
    [((0, 9, ['vEvasvataH']),),
     ((0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (4, 9, ['svataH'])),
     ((0, 1, ['vA', 'vE']), (2, 5, ['vasu', 'vasU']), (6, 9, ['ataH']))]