Search code examples
algorithmsortingmathematical-optimizationcombinatorics

Algorithm to sort one list simultaneously by two comparison functions, minimizing discordant pairs?


Suppose I have a list of tuples:

[(a1, b1), (a2, b2), ..., (an, bn)]

I could sort them by the a's, or the b's, but not both.

But what if I want to sort them by both as well as possible? A good way to measure how well they're sorted is the number of pairs of "a" values that are in the wrong order, plus the number of pairs of "b" values that are in the wrong order. What algorithm will do this quickly?

An algorithm that minimizes a different loss function would also be interesting but I think what would be best for my application is to minimize discordant pairs.


Solution

  • Update: it turns out there is a very simple solution in O(n log n) time.

    Just sort the list by the a components, using the b components as a tiebreaker. (Or vice versa.) Or, if they are numbers, you can sort by the sum of the two components, a + b. This can be done in O(n log n) time using any efficient comparison-based sorting algorithm.

    This solution works because the loss function can be written as a sum of individual loss functions, for each pair of elements. For pairs like (2, 4) vs. (3, 3) which will be discordant whatever their relative order, the individual loss for that pair is always 1. Similarly, when two pairs are equal, such as (4, 5) vs. (4, 5), the individual loss for that pair is 0 whatever their relative order.

    The only non-constant individual loss functions are for pairs where one component is bigger and the other is bigger-or-equal, e.g. (2, 4) vs. (3, 4), or (2, 4) vs. (3, 5). Each of the sorting orders described above will put all such pairs in their optimal order relative to each other. This simultaneously minimises every term in the loss function, so therefore it minimises the total loss.

    Note that this specifically only works for a list of 2-tuples. For 3-tuples or higher, a solution as simple as this won't work, but the ideas in my original answer can be adapted (see below). However, adapting them won't be easy, since the graph will not necessarily be acyclic.


    Original answer (expanded)

    This can be modelled as a kind of graph problem. Each pair (a_i, b_i) is a node in the graph.

    Insert a directed edge i → j whenever both a_i <= a_j and b_i <= b_j, unless both are equal. For any pairs where a_i < a_j and b_i > b_j, or vice versa, and any pairs where a_i = a_j and b_i = b_j, there is no edge. The existence of an edge is equivalent to a preference between the relative ordering of node i and node j; if there is no edge, then the loss is the same whatever the relative ordering of those two nodes.

    For the case of 2-tuples, it is quite straightforward to show that this graph is acyclic, from the way it is constructed. So a topological sorting algorithm will find an ordering such that all edges point "forwards", i.e. node i appears before node j whenever there is an edge i → j. This ordering clearly minimises the loss function, because it simultaneously minimises the individual losses of every pair i, j.

    The only discordant pairs in the resulting order are those which are necessarily discordant; those where, whichever way round that pair ends up, either the a components are out of order, or the b components are.

    Actually implementing a topological sorting algorithm doesn't require constructing the graph explicitly; you can just treat the "nodes" and "edges" as an implicit graph, using comparisons to find the edges, instead of looking them up in some kind of graph data structure. To avoid scanning the whole list to find a node's neighbours on every iteration, you can take advantage of the fact that the edge relation is transitive: if node A only has edges to nodes B, C and D, then node B can only have edges to C and D. This will take O(n²) time in the worst case, but should be more efficient than brute force.