Search code examples
pythonalgorithmsortingsequencecircular-list

Create unique identifier for undirected circular sequences


Say I have an undirected circular sequence that looks like this:

  1 —— 2 —— 3
 /           \
1             1
|             |
3             2
 \           /
  3 —— 2 —— 3

Say I have 3 sequences as below, represented by lists of numbers:

seq1 = [1,1,3,3,2,3,2,1,3,2] # anticlockwise from top left
seq2 = [3,2,3,3,1,1,2,3,1,2] # clockwise from bottom right
seq3 = [3,1,2,3,2,3,3,1,1,2] # clockwise from top right

Since the sequence is directionless, all 3 sequences are essentially identical, and represents the circular sequence above. In reality, I have thousands of these undirected circular sequences, so it is impossible to compare every pair of them. Therefore, I want to create a unique identifier that can represent each unique undirected circular sequence. For example, the identifier should be the same for the 3 sequences above.

My idea is to treat this type of sequences as circular graphs. Then I can assign edge weights as the differences between the two connected nodes, and find the path that traverses all nodes while maximizing the sum of all edge weights. Below is my Python implementation:

def identifier(seq):
    delta_sum = float('-inf')
    res_seq = []
    for i in range(len(seq)):
        new_seq = seq[i:] + seq[:i]
        ds = sum([new_seq[j+1] - new_seq[j] for j in range(len(seq)-1)])
        if ds > delta_sum:
            delta_sum = ds
            res_seq = new_seq
        if -ds > delta_sum:
            delta_sum = -ds
            res_seq = new_seq[::-1]
    return ','.join(map(str, res_seq))

print(identifier(seq1))
print(identifier(seq2))
print(identifier(seq3))

Output:

1,1,2,3,1,2,3,2,3,3
1,1,2,3,1,2,3,2,3,3
1,2,3,2,3,3,1,1,2,3

Clearly my algorithm isn't working. It creates the same identifier for the first two sequences, but creates a different one for the 3rd sequence. Can anyone suggest a relatively fast algorithm (preferably Python code) that can create a unique identifier for this kind of sequences?

Below are some related questions, but not exactly what I want to achieve:

How to check whether two lists are circularly identical in Python

Fast way to compare cyclical data


Solution

  • You could use tuples as hashable identifiers and pick the smallest one from the possible rotations of the sequence:

    def identifier(s):
        return min((*s[i::d],*s[:i:d]) for d in (1,-1) for i in range(len(s)))
    

    Output:

    seq1 = [1,1,3,3,2,3,2,1,3,2] # anticlockwise from top left
    seq2 = [3,2,3,3,1,1,2,3,1,2] # clockwise from bottom right
    seq3 = [3,1,2,3,2,3,3,1,1,2] # clockwise from top right
    
    print(identifier(seq1))
    print(identifier(seq2))
    print(identifier(seq3))
    (1, 1, 2, 3, 1, 2, 3, 2, 3, 3)
    (1, 1, 2, 3, 1, 2, 3, 2, 3, 3)
    (1, 1, 2, 3, 1, 2, 3, 2, 3, 3)
    

    Given that the smallest tuple will start with the smallest value, you can optimize this a bit by first finding the minimum value and only comparing tuples that are formed by starting from the minimum value indexes:

    def identifier(seq):
        start  = min(seq)
        starts = [i for i,v in enumerate(seq) if v == start]
        return min((*seq[i::d],*seq[:i:d]) for d in (1,-1) for i in starts)