today I found a question in sorting a file with linux sort command. When I set the env LANG=En_US, the result is what I expect. But when LANG=en_US, the result is strange. Some commands I ran and the output as follows:
[work@xx:/data1/muce_temp/datamarts/reduce_result_file/302/1d/201212260000]$ cat dd.dat
23 340_guard 16
23 340_guard 17
23 340_guard 18
23 360_guard... 16
23 360_guard 16
23 360_guard... 17
23 360_guard... 18
[work@xx:/data1/muce_temp/datamarts/reduce_result_file/302/1d/201212260000]$ LANG=En_US sort dd.dat
23 340_guard 16
23 340_guard 17
23 340_guard 18
23 360_guard 16
23 360_guard... 16
23 360_guard... 17
23 360_guard... 18
[work@xx:/data1/muce_temp/datamarts/reduce_result_file/302/1d/201212260000]$ LANG=en_US sort dd.dat
23 340_guard 16
23 340_guard 17
23 340_guard 18
23 360_guard... 16
23 360_guard 16 (why this line appear here ? )
23 360_guard... 17
23 360_guard... 18
the format details of rows in this file likes:
2^E3^F360_guard^E...^I16^Ee^E17/18^I63776769$
2^E3^F360_guard^E^I16^Ee^E17/18^I63776769$
2^E3^F360_guard^E...^I17^Ei^E0^I63776771$
2^E3^F360_guard^E...^I18^Ei^E1^I63776773$
^E is '\x05' , ^F is '\x06', ^I is tab, $ is '\n' .
Thanks in advance.
The locale en_US invokes a smarter sorting algorithm that ignores those strings of dots like they would normally be ignored in manual sorting. The locale system is obviously being case-sensitive, and there is no En_US locale, so En_US is falling back to the default language (probably C).