Search code examples
bashshellsortingposixlocale

How to get the `sort` shell command to compare raw bytes?


It seems like the posix sort command line utility will do some fancy locale based shenanegans to compare the given strings.

I scanned the man page but could not seem to find a way to get it to use the raw byte values instead. Is there a way to get sort (I have the GNU coreutils version) to behave like qsort(array_of_my_strings, N, strcmp) would in C? Solutions using another tool then sort would be fine too.

For demonstration, I currently get:

printf "\xC3\xBC\n\x76\n" | sort
ü
v

because the german umlaut ü seems to be compared as u which comes before v, despite \xC3 being larger than \x76.

What i want is

printf "\xC3\xBC\n\x76\n" | sort --raw-bytes-please
v
ü

Solution

  • Collation order and (multi-byte) character type are influenced by your locale. The locale name for disabling multibyte and locale-aware behaviors is C.

    Thus:

    LC_COLLATE=C LC_CTYPE=C sort
    

    ...will set only the character type and the collation order (assuming LC_ALL isn't set, in which case they would be ignored).


    As a big hammer, you can also use:

    LC_ALL=C sort
    

    albeit with side effects such as changing the language used for printing error messages &c to the strings originally written by sort's developers with no translation tables in effect.