Search code examples
bashsortinglanguage-lawyerlexicographicstable-sort

Does sort -n handle ties predictably when the --stable option is NOT provided? If it does, how?


Here it looks like the space after the 3 in both rows breaks the numerical sorting and lets the alphabetic sorting kick in, so that 11<2:

$ echo -e '3 2\n3 11' | sort -n
3 11
3 2

In man sort, I read

       -s, --stable
              stabilize sort by disabling last-resort comparison

which implies that without -s a last-resort comparison is done (between ties, because -s does not affect non-ties).

So the question is: how is this last-resort comparison accomplished? A reference to the source code would be welcome, if necessary to answer the question.

This answer Unix deduces, from experimentation, that the sorting of ties is lexicographic.

Does the standard/POSIX say anything about this?


Solution

  • Question: How is the last resort comparison done?

    This is quickly answered in documentation of GNU coreutils:

    A pair of lines is compared as follows: sort compares each pair of fields (see --key), in the order specified on the command line, according to the associated ordering options, until a difference is found or no fields are left. If no key fields are specified, sort uses a default key of the entire line. Finally, as a last resort when all keys compare equal, sort compares entire lines as if no ordering options other than --reverse (-r) were specified. The --stable (-s) option disables this last-resort comparison so that lines in which all fields compare equal are left in their original relative order. The --unique (-u) option also disables the last-resort comparison.

    Unless otherwise specified, all comparisons use the character collating sequence specified by the LC_COLLATE

    source: Sort Invocation GNU Coreutils

    This means that the final resort will sort according to the sorting order of LC_COLLATE, i.e. lexicographically (mostly).

    POSIX, on the other hand adds a final ultra-last resort option which is stricter.

    If this collating sequence does not have a total ordering of all characters (see XBD LC_COLLATE), any lines of input that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale.

    source: Sort POSIX standard

    I am not certain if this is implemented in GNU sort, since it is not a requirement. Nonetheless, POSIX strongly recommends it (See Rationale last paragraph)

    What does this mean in case of the OP?

    There is an uncomfortable misunderstanding of the key-definitions. Assume you do something like

    $ sort --option -k1,3 file
    

    It is often understood that sort will first sort on field 1, then 2 and finally 3 using --option. This is incorrect. It will use the key to be defined as the substring consisting of fields 1 till 3. And in case when two lines collate equally, sort will perform the last-resort option (see earlier)

    aa  bb cc xxxxxxxx
    ---------           <<< rule1: according to the key
    ------------------  <<< rule2: lexicographical sort (last resort)
    

    Using GNU sort, you can see which substring is used for the sort. This is done with the --debug option. Here you see the difference between 3 simple cases:

    # Sort lexicographically with full line
    # -------------------------------------------------------------------
    $ echo -e "ab c d\nefg h i" | sort --debug
    sort: using ?en_GB.UTF-8? sorting rules
    ab c d
    ______
    efg h i
    _______
    # -------------------------------------------------------------------
    # Sort lexicographically with the substring formed by field 1 and 2
    # -------------------------------------------------------------------
    $ echo -e "ab c d\nefg h i" | sort -k1,2 --debug
    sort: using ?en_GB.UTF-8? sorting rules
    sort: leading blanks are significant in key 1; consider also specifying 'b'
    ab c d
    ____
    ______
    efg h i
    _____
    _______
    # -------------------------------------------------------------------
    # Sort lexicographically with field 1 followed by field 2
    # -------------------------------------------------------------------
    $ echo -e "ab c d\nefg h i" | sort -k1,1 -k2,2 --debug
    sort: using ?en_GB.UTF-8? sorting rules
    sort: leading blanks are significant in key 1; consider also specifying 'b'
    sort: leading blanks are significant in key 2; consider also specifying 'b'
    ab c d
    __
      __
    ______
    efg h i
    ___
       __
    _______
    

    When you do a numeric sort (using -n or -g), sort will attempt to extract a number from the key (1234abc leads to 1234) and use that number for the sorting.

    # Sort numerically with full line
    # -------------------------------------------------------------------
    $ echo -e "3a 11a\n3b 2b" | sort -n --debug
    sort: using ?en_GB.UTF-8? sorting rules
    3a 11a
    _         # numeric on full line
    ______    # lexicographically on full line  (last resort)
    3b 2b
    _         # numeric on full line
    _____     # lexicographically on full line  (last resort)
    # -------------------------------------------------------------------
    # Sort numerically with field 1 then field 2
    # -------------------------------------------------------------------
    $ echo -e "3a 11a\n3b 2b" | sort -n -k1,1 -k2,2 --debug
    sort: using ?en_GB.UTF-8? sorting rules
    3b 2b
    _         # numeric on field 1
       _      # numeric on field 2
    _____     # lexicographically on full line  (last resort)
    3a 11a
    _         # numeric on field 1
       __     # numeric on field 2
    ______    # lexicographically on full line  (last resort)
    

    As you notice in these two cases, even though the first field can be ordered lexicographically 3a < 3b, it is ignored as we only pick the number from the key.