Search code examples
javacomparator

thenComparing seems to not be lexicographic, despite the docs saying so


According to the docs,

thenComparing

default <U extends Comparable<? super U>> Comparator thenComparing(Function<? super T,? extends U> keyExtractor)

Returns a lexicographic-order comparator with a function that extracts a Comparable sort key.

Thus, the output of the following bit of code should be "a : 1", "a : 12", "a : 4". However, the resulting list is sorted by ascending numbers.

How come? Am I misunderstanding the docs?

(The s-attribute is just there to have an initial sorting key, so that thenComparing can be applied)

import java.util.*;

public class MyClass {
    public static void main(String[] args) {
      List<Container> list = new ArrayList<>();
      list.add(new Container("a", 4));
      list.add(new Container("a", 12));
      list.add(new Container("a", 1));
  
      list.sort(Comparator.comparing(Container::getS)
          .thenComparing(Container::getX)
      );
      
      for (Container c : list) {
          System.out.println(c.getS() + " : " + c.getX());
      }
    }
    
    public static class Container {
        private String s;
        private Integer x;
        
        public Container(String s, Integer x) {
            this.s = s;
            this.x = x;
        }
        
        public Integer getX() {
            return this.x;
        }
        
        public String getS() {
            return this.s;
        }
    }
}

Solution

  • Updated answer

    "Lexicographic order" literally means ordering just like in a dictionary (from the Greek word lexicon which means dictionary). Dictionary order works as follows:

    1. Compare the first letter of each word. If they're different, stop here.

    2. With the first letter of both words being the same, now compare the second letter of each word. If they're different, stop.

    3. Repeat step 2 with the third, and following letters until you find a difference.

    So a word is broken up into elements (letters) and each element is compared in turn until a difference is found. This is the definition of lexicographic order. Exactly the same happens with comparing()..thenComparing(). How? Like this:

    Comparator.comparing(Container::getS) <-- this is a comparator

    .thenComparing(Container::getX) <-- this takes the previous comparator and returns a new comparator.

    The key here is that the comparator returned by thenComparing doesn't just compare x. In fact thenComparing returns a comparator that compares s first, and then x. That's the definition of lexicographic order. But note that s and x are each one an indivisible element. You compare s using natural order, and then you compare x using natural order (which means comparing as an integer).

    Lexicographic order does not mean splitting up a single element (x) into sub-elements. This would be analogous to an English dictionary trying to compare words by splitting individual letters into the individual strokes of the pen that make up that letter.


    Addendum by the OP of the question:

    The above can be stated succinctly as: the comparator resulting from a .thenComparing induces a lexicographical order on a certain cartesian product. (see the relevant section on Wikipedia for lex orders on cartesian products)

    More precisely, say a class C has two comparable attributes (i.e. whose class implementing Comparabale) public S s and public T t. Then Comparator.comparing(C::getS).thenComparing(C::getT) returns a comparator which gives an order on "the cartesian product of S and T". And this order is lexicographical, i.e. "first sort S into buckets using the compareTo of S, then sort the T-objects in every S-bucket using the compareTo of T".


    Old answer (kept just because it's referenced in a comment)

    Lexicographic order is not just alphabetical order. It can also be applied to numbers, in which case it is numerical order. As stated in the Online Encyclopedia of Integer Sequences:

    When applied to numbers, lexicographic order is increasing numerical order

    Thus, it is entirely correct that thenComparing orders 4 before 12, since you're comparing Integer instances.

    This is not solely related to thenComparing: the same result would be obtained by simply calling comparing. Note that calling comparing followed by thenComparing does not magically put two fields into a String. It compares the first field, and then, if and only if, the first fields of the two instances are equal, it commpares the second field (referred to by thenComparing) using that field's (and only that field's) natural order.