Search code examples
stringbashshellsorting

Sorting non-padded lines


A list of non-zero-padded unsorted lines is given:

seq 100 | shuf

piping that into sort would give:

1
10
100
11
12
13
...

because sort sorts lexicographical by default. Thus, it has -n option for numerical sort, which would yield the expected result. However, if strings are not totally numeric, this wouldn't work:

seq 100 | shuf | sed s/^/E/ | sort -n

Or for a more complex case:

paste -dS <(seq 100 | shuf | sed 's/^/E/' | sort -n) <(seq 100 | shuf)
---
E1S70
E10S75
E100S41
E11S53
...

with the expected output of lexicographical sorting for characters but numerical sorting for numbers:

E1S70
E10S75
E11S53
E100S41

Think of numbers as a single block, compared numerically with other numbers, but lexicographically with other characters.

What's an efficient way to sort non-zero-padded mixed strings?


Solution

  • You appear to be describing natural sort:

    natural sort order (or natural sorting) is the ordering of strings in alphabetical order, except that multi-digit numbers are treated atomically, i.e., as if they were a single character

    It is not clear how you wish to handle fractional numbers (could be delimited with various characters such as . or ,); or a mix of zero-padded and non-padded numbers.


    For shell programming, GNU has extended sort with a -V/--version-sort option, although this may not do what you want for a list such as:

    B27Y23S1
    E10S33
    ES020.4F3
    ES20.14F3
    ES2014F3
    YF29399G3G3G
    

    Perl has Sort::Versions, Sort::Key::Natural, etc.

    (cf. Perl sort numbers naturally)

    Python has natsort, etc.

    (cf. Is there a built in function for string natural sort?)

    Javascript has natural-sortby, etc.

    (cf. Natural sort of alphanumerical strings in JavaScript)

    I expect other environments have also come up with their own solutions.