Search code examples
bashsortingcollationcollate

sort command not working as expected for uppercase letters followed by underscore


I'm sorting a list of usernames. When the letters are lowercase, the sort command works as expected.

Expected and actual output for lowercase:

n
n_123
na
na_123

When the characters are uppercase and followed by an underscore, things get weird.

Expected output for uppercase:

N
N_123
NA
NA_123

Actual output for uppercase using sort:

N
NA
NA_123
N_123

I thought I'd be able to solve this using

env LC_COLLATE=C sort $file

but no dice.

Actual output using env LC_COLLATE=C sort:

N
NA
NA_123
N_123

I'm running GNU bash, version 4.4.12(1)-release (x86_64-apple-darwin16.3.0) on Mac OS X 10.12.3

Any help would be much appreciated.


Solution

  • Underscore is ASCII 95 and that comes after all the uppercase letters (A-Z) i.e. 65-90. So in sorting uppercase letters will always come before _.

    If you want to delimit at _ then you can use -t _ to get your expected output:

    sort -t _ -k1,1 file
    N
    N_123
    NA
    NA_123
    

    Reason why your sort command worked with lowercase letters is because lowercase letters come after _ i.e. 97-122