Sorting a list to maximize similarity between neighboring items

I have a list of keys:

['A', 'B', 'C']

For each of those keys there's a list of properties:

{
    'A': [2,3],
    'B': [1,2],
    'C': [4]
}

I wish to sort the list of labels such that neighboring labels share as many of the properties as possible.

In the above example A and B share the relation 2, so they should be next to each other - whereas C shares nothing with them, so it can go anywhere.

So the possible orders for this example would be as follows:

["A","B","C"] # acceptable
["A","C","B"] # NOT acceptable
["B","A","C"] # acceptable
["B","C","A"] # NOT acceptable
["C","A","B"] # acceptable
["C","B","A"] # acceptable

Buckets

Actually I would prefer this to be represented by putting them into "buckets":

[["A", "B"], ["C"]] # this can represent all four possible orders above.

However, this gets problematic if a label belongs to two different buckets:

{
    'A': [2,3],
    'B': [1,2],
    'C': [1,4]
}

How would I represent that?

I could put it like this:

[["A", "B"], ["C", "B"]]

But then I need another processing step to turn the list of buckets into the final representation:

["A", "B", "C"]

And above that there could be recursively nested buckets:

[[["A","B"], ["C"]], ["D"]]

And then these could overlap:

[[["A","B"], ["C"]], ["A","D"]]

Quality

The "closeness", i.e. quality of a solution is defined as the sum of the intersection of relations between neighbors (the higher the quality the better):

def measurequality(result,mapping):
    lastKey = None
    quality = 0
    for key in result:
        if lastKey is None:
            lastKey = key
            continue
        quality += len(set(mapping[key]).intersection(mapping[lastKey]))
        lastKey = key
    return quality

# Example determining that the solution ['A', 'B', 'C'] has quality 1:
#measurequality(['A', 'B', 'C'],
#    {
#        'A': [2,3],
#        'B': [1,2],
#        'C': [4]
#    })

Brute-Forcing

Brute-forcing does not constitute a solution (in practice the list contains on the order of several thousand elements - though, if anyone got a brute-forcing approach that is better than O(n²)...).

However, using brute-forcing to create additional test cases is possible:

produce a list L of n items ['A','B','C',...]
produce for each item a dictionary R of relations (up to n random numbers between 0 and n should be sufficient).
produce all possible permutations of L and feed them together with R into measurequality() and keep those with maximal return value (might not be unique).

Code for creating random testcases to test the implementation:

import string
import random

def randomtestcase(n):
    keys=list(string.ascii_uppercase[0:n])

    minq = 0
    maxq = 0
    while minq == maxq:
        items={}
        for key in keys:
            items[key] = random.sample(range(1,10),int(random.random()*10))

        minq = n*n
        minl = list(keys)
        maxq = 0
        maxl = list(keys)
        for _ in range(0, 1000): # TODO: explicitly construct all possible permutations of keys.
            random.shuffle(keys)
            q = measurequality(keys,items)
            if q < minq:
                minq = q
                minl = list(keys)
            if maxq < q:
                maxq = q
                maxl = list(keys)

    return ( items, minl, maxq )

( items, keys, quality ) = randomtestcase(5)

sortedkeys = dosomething( keys, items )
actualquality = measurequality( sortedkeys, items )
if actualquality < quality:
    print('Suboptimal: quality {0} < {1}'.format(actualquality,quality))

Attempt

One of the many "solutions" that didn't work (very broken, this one doesn't have the selection of initial element / choice between prepending and appending to the result list that I had in others):

def dosomething(keys,items):
    result = []
    todo = list(keys)
    result.append(todo.pop())
    while any(todo):
        lastItems = set(items[result[-1]])
        bestScore = None
        bestKey = None
        for key in todo:
            score = set(items[key]).intersection(lastItems)
            if bestScore is None or bestScore < score:
                bestScore = score
                bestKey = key
        todo.remove(bestKey)
        result.append(bestKey)
    return result

Examples

(Also check out the example generator in the section Brute-Forcing above.)

Testing code trying some examples:

def test(description,acceptable,keys,arguments):
    actual = dosomething(keys,arguments)
    if "".join(actual) in acceptable:
        return 0
    print("\n[{0}] {1}".format("".join(keys),description))
    print("Expected: {0}\nBut was: {1}".format(acceptable,actual))
    print("Quality of result: {0}".format(measurequality(actual,arguments)))
    print("Quality of expected: {0}".format([measurequality(a,arguments) for a in acceptable]))
    return 1

print("EXAMPLES")
failures = 0

# Need to try each possible ordering of letters to ensure that the order of keys
# wasn't accidentially already a valid ordering.
for keys in [
        ["A","B","C"],
        ["A","C","B"],
        ["B","A","C"],
        ["B","C","A"],
        ["C","A","B"],
        ["C","B","A"]
    ]:
    failures += test(
        "1. A and B both have 2, C doesn't, so C can go first or last but not in between.",
        ["ABC", "BAC", "CAB", "CBA"],
        keys,
        {
            "A": [2,3],
            "B": [1,2],
            "C": [4]
        })

    failures += test(
        "2. They all have 2, so they can show up in any order.",
        ["ABC", "ACB", "BAC", "BCA", "CAB", "CBA"],
        keys,
        {
            "A": [2,3],
            "B": [1,2],
            "C": [2]
        })

    failures += test(
        "3. A and B share 2, B and C share 1, so B must be in the middle.",
        ["ABC", "CBA"],
        keys,
        {
            "A": [2,3],
            "B": [1,2],
            "C": [1]
        })

    failures += test(
        "4. Each shares something with each other, creating a cycle, so they can show up in any order.",
        ["ABC", "ACB", "BAC", "BCA", "CAB", "CBA"],
        keys,
        {
            "A": [2,3],
            "B": [1,2],
            "C": [1,3]
        })

if 0 < failures:
    print("{0} FAILURES".format(failures))

Precedence

As it was asked: the numbers used for the relations aren't in an order of precedence. An order of precedence exists, but it's a partial order and not the one of the numbers. I just didn't mention it because it makes the problem harder.

So given this example:

{
    'A': [2,3],
    'B': [1,2],
    'C': [4]
}

Might be replaced by the following (using letters instead of numbers and adding precedence information):

{
    'A': [('Q',7),('R',5)],
    'B': [('P',6),('Q',6)],
    'C': [('S',5)]
}

Note that

The precedence is meaningful only within a list, not across lists.
The precedence of shared relations might be different between two lists.
Within a list there might be the same precedence several times.

Solution

This is a Travelling Salesman Problem, a notoriously hard problem to solve optimally. The code presented solves for 10,000 nodes with simple interconnections (i.e. one or two relations each) in about 15 minutes. It performs less well for sequences that are more richly interconnected. This is explored in the test results below.

The idea of precedence, mentioned by the OP, is not explored.

The presented code consists of a heuristic solution, a brute-force solution that is optimal but not practical for large node_sets, and some simple but scalable test data generators, some with known optimal solutions. The heuristic finds optimal solutions for the OP's 'ABC' example, my own 8-item node_set, and for scalable test data for which optimal solutions are known.

If the performance is not good enough, at least it is a bit of a first effort and has the beginnings of a test 'workshop' to improve the heuristic.

Test Results

>>> python sortbylinks.py

===============================
"ABC" (sequence length 3)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('A', 'B', 'C'), ('C', 'A', 'B'), ('B', 'A', 'C')], ...and more
Top score: 1
"optimise_order" function took 0.0s
Optimal quality: 1

===============================
"Nick 8-digit" (sequence length 8)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('A', 'E', 'F', 'C', 'D', 'G', 'B', 'H')], ...and more
Top score: 6
"optimise_order" function took 0.0s
Optimal quality: 6

Short, relatively trivial cases appear to be no problem.

===============================
"Quality <1000, cycling subsequence, small number of relations" (sequence length 501)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('AAAC', 'AAAL', 'AAAU', ...), ...], ...and more
Top score: 991
"optimise_order" function took 2.0s
Optimal quality: 991

===============================
"Quality <1000, cycling subsequence, small number of relations, shuffled" (sequence length 501)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('AADF', 'AAKM', 'AAOZ', ...), ...], ...and more
Top score: 991
"optimise_order" function took 2.0s
Optimal quality: 991

The "Quality <1000, cycling subsequence" (sequence length 501) is interesting. By grouping nodes with {0, 1} relation sets the quality score can be nearly doubled. The heuristic finds this optimal sequence. Quality 1000 is not quite possible because these double-linked groups need attaching to each other via a single-linked node every so often (e.g. {'AA': {0, 1}, 'AB': {0, 1}, ..., 'AZ': {0, 1}, <...single link here...> 'BA': {1, 2}, 'BB': {1, 2}, ...}).

Performance is still good for this test data with few relations per node.

"Quality 400, single unbroken chain, initial solution is optimal" (sequence length 401)
===============================
Working...
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAAA', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 400
"optimise_order" function took 0.0s
Optimal quality: 400

===============================
"Quality 400, single unbroken chain, shuffled" (sequence length 401)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAAA', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 400
"optimise_order" function took 0.0s
Optimal quality: 400

One of the difficulties with Travelling Salesman Problems (TSPs) is knowing when you have an optimal solution. The heuristic doesn't seem to converge any faster even from a near-optimal or optimal start.

===============================
"10,000 items, single unbroken chain, initial order is optimal" (sequence length 10001)
===============================
Working...
Finding heuristic candidates...
Number of candidates with top score: 1
[('AOUQ', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 10002
"optimise_order" function took 947.0s
Optimal quality: 10000

When there are very small numbers of relations, even if there are many nodes, performance is pretty good and the results may be close to optimal.

===============================
"Many random relations per node (0..n), n=200" (sequence length 200)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAEO', 'AAHC', 'AAHQ', ...), ...], ...and more
Top score: 6861
"optimise_order" function took 94.0s
Optimal quality: ?

===============================
"Many random relations per node (0..n), n=500" (sequence length 500)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAJT', 'AAHU', 'AABT', ...), ...], ...and more
Top score: 41999
"optimise_order" function took 4202.0s
Optimal quality: ?

This is more like the data generated by the OP, and also more like the classical Travelling Salesman Problem (TSP) where you have a set of distances between each city pair (for 'city' read 'node') and nodes are typically richly connected to each other. In this case the links between nodes is partial-- there is no guarantee of a link between any 2 nodes.

The heuristic's time performance is much worse in cases like this. There are between 0 and n random relations for each node, for n nodes. This is likely to mean many more swap combinations yield improved quality, swaps and quality checks are more expensive in themselves and many more passes will be needed before the heuristic converges on its best result. This may mean O(n^3) in the worst case.

Performance degrades as the number of nodes and relations increases (note the difference between n=200-- 3 minutes-- and n=500-- 70 minutes.) So currently the heuristic may not be practical for several thousand richly-interconnected nodes.

In addition, the quality of the result for this test can't be known precisely because a brute-force solution is not computationally feasible. 6861 / 200 = 34.3 and 41999 / 500 = 84.0 average connections between node pairs doesn't look too far off.

Code for the Heuristic and Brute Force Solvers

import sys
from collections import deque
import itertools
import random
import datetime

# TODO: in-place swapping? (avoid creating copies of sequences)


def timing(f):
    """
    decorator for displaying execution time for a method
    """
    def wrap(*args):
        start = datetime.datetime.now()
        f_return_value = f(*args)
        end = datetime.datetime.now()
        print('"{:s}" function took {:.1f}s'.format(f.__name__, (end-start).seconds))
        return f_return_value
    return wrap


def flatten(a):
    # e.g. [a, [b, c], d] -> [a, b, c, d]
    return itertools.chain.from_iterable(a)


class LinkAnalysis:
    def __init__(self, node_set, max_ram=100_000_000, generate_seeds=True):
        """
        :param node_set: node_ids and their relation sets to be arranged in sequence
        :param max_ram: estimated maximum RAM to use
        :param generate_seeds: if true, attempt to generate some initial candidates based on sorting
        """
        self.node_set = node_set
        self.candidates = {}
        self.generate_seeds = generate_seeds
        self.seeds = {}
        self.relations = []
        # balance performance and RAM using regular 'weeding'
        candidate_size = sys.getsizeof(deque(self.node_set.keys()))
        self.weed_interval = max_ram // candidate_size

    def create_initial_candidates(self):
        print('Working...')
        self.generate_seed_from_presented_sequence()
        if self.generate_seeds:
            self.generate_seed_candidates()

    def generate_seed_from_presented_sequence(self):
        """
        add initially presented order of nodes as one of the seed candidates
        this is worthwhile because the initial order may be close to optimal
        """
        presented_sequence = self.presented_sequence()
        self.seeds[tuple(self.presented_sequence())] = self.quality(presented_sequence)

    def presented_sequence(self) -> list:
        return list(self.node_set.keys())  # relies on Python 3.6+ to preserve key order in dicts

    def generate_seed_candidates(self):
        initial_sequence = self.presented_sequence()
        # get relations from the original data structure
        relations = sorted(set(flatten(self.node_set.values())))
        # sort by lowest precedence relation first
        print('...choosing seed candidates')
        for relation in reversed(relations):
            # use true-false ordering: in Python, True > False
            initial_sequence.sort(key=lambda sortkey: not relation in self.node_set[sortkey])
            sq = self.quality(initial_sequence)
            self.seeds[tuple(initial_sequence)] = sq

    def quality(self, sequence):
        """
        calculate quality of full sequence
        :param sequence:
        :return: quality score (int)
        """
        pairs = zip(sequence[:-1], sequence[1:])
        scores = [len(self.node_set[a].intersection(self.node_set[b]))
                  for a, b in pairs]
        return sum(scores)

    def brute_force_candidates(self, sequence):
        for sequence in itertools.permutations(sequence):
            yield sequence, self.quality(sequence)

    def heuristic_candidates(self, seed_sequence):
        # look for solutions with higher quality scores by swapping elements
        # start with large distances between elements
        # then reduce by power of 2 until swapping next-door neighbours
        max_distance = len(seed_sequence) // 2
        max_pow2 = int(pow(max_distance, 1/2))
        distances = [int(pow(2, r)) for r in reversed(range(max_pow2 + 1))]
        for distance in distances:
            yield from self.seed_and_variations(seed_sequence, distance)
        # seed candidate may be better than its derived sequences -- include it as a candidate
        yield seed_sequence, self.quality(seed_sequence)

    def seed_and_variations(self, seed_sequence, distance=1):
        # swap elements at a distance, starting from beginning and end of the
        # sequence in seed_sequence
        candidate_count = 0
        for pos1 in range(len(seed_sequence) - distance):
            pos2 = pos1 + distance
            q = self.quality(seed_sequence)
            # from beginning of sequence
            yield self.swap_and_quality(seed_sequence, q, pos1, pos2)
            # from end of sequence
            yield self.swap_and_quality(seed_sequence, q, -pos1, -pos2)
            candidate_count += 2
            if candidate_count > self.weed_interval:
                self.weed()
                candidate_count = 0

    def swap_and_quality(self, sequence, preswap_sequence_q: int, pos1: int, pos2: int) -> (tuple, int):
        """
        swap and return quality (which can easily be calculated from present quality
        :param sequence: as for swap
        :param pos1: as for swap
        :param pos2: as for swap
        :param preswap_sequence_q: quality of pre-swapped sequence
        :return: swapped sequence, quality of swapped sequence
        """
        initial_node_q = sum(self.node_quality(sequence, pos) for pos in [pos1, pos2])
        swapped_sequence = self.swap(sequence, pos1, pos2)
        swapped_node_q = sum(self.node_quality(swapped_sequence, pos) for pos in [pos1, pos2])
        qdelta = swapped_node_q - initial_node_q
        swapped_sequence_q = preswap_sequence_q + qdelta
        return swapped_sequence, swapped_sequence_q

    def swap(self, sequence, node_pos1: int, node_pos2: int):
        """
        deques perform better than lists for swapping elements in a long sequence
        :param sequence-- sequence on which to perform the element swap
        :param node_pos1-- zero-based position of first element
        :param pos2--: zero-based position of second element
        >>> swap(('A', 'B', 'C'), 0, 1)
        ('B', 'A', 'C')
        """
        if type(sequence) is tuple:
            # sequence is a candidate (which are dict keys and hence tuples)
            # needs converting to a list for swap processing
            sequence = list(sequence)
        if node_pos1 == node_pos2:
            return sequence
        tmp = sequence[node_pos1]
        sequence[node_pos1] = sequence[node_pos2]
        sequence[node_pos2] = tmp
        return sequence

    def node_quality(self, sequence, pos):
        if pos < 0:
            pos = len(sequence) + pos
        no_of_links = 0
        middle_node_relations = self.node_set[sequence[pos]]
        if pos > 0:
            left_node_relations = self.node_set[sequence[pos - 1]]
            no_of_links += len(left_node_relations.intersection(middle_node_relations))
        if pos < len(sequence) - 1:
            right_node_relations = self.node_set[sequence[pos + 1]]
            no_of_links += len(middle_node_relations.intersection(right_node_relations))
        return no_of_links

    @timing
    def optimise_order(self, selection_strategy):
        top_score = 0
        new_top_score = True
        self.candidates.update(self.seeds)
        while new_top_score:
            top_score = max(self.candidates.values())
            new_top_score = False
            initial_candidates = {name for name, score in self.candidates.items() if score == top_score}
            for initial_candidate in initial_candidates:
                for candidate, q in selection_strategy(initial_candidate):
                    if q > top_score:
                        new_top_score = True
                        top_score = q
                        self.candidates[tuple(candidate)] = q
                self.weed()
        print(f"Number of candidates with top score: {len(list(self.candidates))}")
        print(f"{list(self.candidates)[:3]}, ...and more")
        print(f"Top score: {top_score}")

    def weed(self):
        # retain only top-scoring candidates
        top_score = max(self.candidates.values())
        low_scorers = {k for k, v in self.candidates.items() if v < top_score}
        for low_scorer in low_scorers:
            del self.candidates[low_scorer]

Code Glossary

node_set: a set of labelled nodes of the form 'unique_node_id': {relation1, relation2, ..., relationN}. The set of relations for each node can contain either no relations or an arbitrary number.

node: a key-value pair consisting of a node_id (key) and set of relations (value)

relation: as used by the OP, this is a number. If two nodes both share relation 1 and they are neighbours in the sequence, it adds 1 to the quality of the sequence.

sequence: an ordered set of node ids (e.g. ['A', 'B', 'C'] that is associated with a quality score. The quality score is the sum of shared relations between nodes in the sequence. The output of the heuristic is the sequence or sequences with the highest quality score.

candidate: a sequence that is currently being investigated to see if it is of high quality.

Method

generate seed sequences by stable sorting on the presence or absence of each relation in a linked item
The initially presented order is also one of the seed sequences in case it is close to optimal
For each seed sequence, pairs of nodes are swapped looking for a higher quality score
Execute a "round" for each seed sequence. A round is a shellsort-like pass over the sequence, swapping pairs of nodes, at first at a distance, then narrowing the distance until there is a distance of 1 (swapping immediate neighbours.) Keep only those sequences whose quality is more than the current top quality score
If a new highest quality score was found in this round, weed out all but top-score candidates and repeat 4 using top scorers as seeds. Otherwise exit.

Tests and Test Data Generators

The heuristic has been tested using small node_sets, scaled data of a few hundred up to 10,000 nodes with very simple relationships, and a randomised, richly interconnected node_set more like the OP's test data generator. Perfect single-linked sequences, linked cycles (small subsequences that link within themselves, and to each other) and shuffling have been useful to pick up and fix weaknesses.

ABC_links = {
    'A': {2, 3},
    'B': {1, 2},
    'C': {4}
}

nick_links = {
    'B': {1, 2, 4},
    'C': {4},
    'A': {2, 3},
    'D': {4},
    'E': {3},
    'F': {5, 6},
    'G': {2, 4},
    'H': {1},
}

unbroken_chain_linked_tail_to_head = ({1, 3}, {3, 4}, {4, 5}, {5, 6}, {6, 7}, {7, 8}, {8, 9}, {9, 10}, {10, 1})
cycling_unbroken_chain_linked_tail_to_head = itertools.cycle(unbroken_chain_linked_tail_to_head)


def many_nodes_many_relations(node_count):
    # data set with n nodes and random 0..n relations as per OP's requirement
    relation_range = list(range(node_count))
    relation_set = (
        set(random.choices(relation_range, k=random.randint(0, node_count)))
        for _ in range(sys.maxsize)
    )
    return scaled_data(node_count, relation_set)


def scaled_data(no_of_items, link_sequence_model):
    uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    # unique labels using sequence of four letters (AAAA, AAAB, AAAC, .., AABA, AABB, ...)
    item_names = (''.join(letters) for letters in itertools.product(*([uppercase] * 4)))
    # only use a copy of the original link sequence model-- otherwise the model could be exhausted
    # or start mid-cycle
    # https://stackoverflow.com/questions/42132731/how-to-create-a-copy-of-python-iterator
    link_sequence_model, link_sequence = itertools.tee(link_sequence_model)
    return {item_name: links for _, item_name, links in zip(range(no_of_items), item_names, link_sequence)}


def shuffled(items_list):
    """relies on Python 3.6+ dictionary insertion-ordered keys"""
    shuffled_keys = list(items_list.keys())
    random.shuffle(shuffled_keys)
    return {k: items_list[k] for k in shuffled_keys}


cycling_quality_1000 = scaled_data(501, cycling_unbroken_chain_linked_tail_to_head)
cycling_quality_1000_shuffled = shuffled(cycling_quality_1000)

linked_forward_sequence = ({n, n + 1} for n in range(sys.maxsize))
# { 'A': {0, 1}, 'B': {1, 2}, ... } links A to B to ...
optimal_single_linked_unbroken_chain = scaled_data(401, linked_forward_sequence)
shuffled_single_linked_unbroken_chain = shuffled(optimal_single_linked_unbroken_chain)

large_node_set = scaled_data(10001, cycling_unbroken_chain_linked_tail_to_head)
large_node_set_shuffled = shuffled(large_node_set)


tests = [
    ('ABC', 1, ABC_links, True),
    ('Nick 8-digit', 6, nick_links, True),
    # ('Quality <1000, cycling subsequence, small number of relations', 1000 - len(unbroken_chain_linked_tail_to_head), cycling_quality_1000, True),
    # ('Quality <1000, cycling subsequence, small number of relations, shuffled', 1000 - len(unbroken_chain_linked_tail_to_head), cycling_quality_1000_shuffled, True),
    ('Many random relations per node (0..n), n=200', '?', many_nodes_many_relations(200), True),
    # ('Quality 400, single unbroken chain, initial solution is optimal', 400, optimal_single_linked_unbroken_chain, False),
    # ('Quality 400, single unbroken chain, shuffled', 400, shuffled_single_linked_unbroken_chain, True),
    # ('10,000 items, single unbroken chain, initial order is optimal', 10000, large_node_set, False),
    # ('10,000 items, single unbroken chain, shuffled', 10000, large_node_set_shuffled, True),
]
for title, expected_quality, item_links, generate_seeds in tests:
    la = LinkAnalysis(node_set=item_links, generate_seeds=generate_seeds)
    seq_length = len(list(la.node_set.keys()))
    print()
    print('===============================')
    print(f'"{title}" (sequence length {seq_length})')
    print('===============================')
    la.create_initial_candidates()
    print('Finding heuristic candidates...')
    la.optimise_order(la.heuristic_candidates)
    print(f'Optimal quality: {expected_quality}')
    # print('Brute Force working...')
    # la.optimise_order(la.brute_force_candidates)

Performance

The heuristic is more 'practical' than the brute force solution because it leaves out many possible combinations. It may be that a lower-quality sequence produced by an element swap is actually one step away from a much higher-quality score after one more swap, but such a case might be weeded out before it could be tested.

The heuristic appears to find optimal results for single-linked sequences or cyclical sequences linked head to tail. These have a known optimal solution and the heuristic finds that solution and they may be less complex and problematic than real data.

A big improvement came with the introduction of an "incremental" quality calculation which can quickly calculate the quality difference a two-element swap makes without recomputing the quality score for the entire sequence.