Search code examples
linuxsortinglang

the difference between En_US and en_US?


today I found a question in sorting a file with linux sort command. When I set the env LANG=En_US, the result is what I expect. But when LANG=en_US, the result is strange. Some commands I ran and the output as follows:

[work@xx:/data1/muce_temp/datamarts/reduce_result_file/302/1d/201212260000]$ cat dd.dat                 
23 340_guard    16                                                                                                        
23 340_guard    17                                                                                                        
23 340_guard    18                                                                                                        
23 360_guard... 16                                                                                                      
23 360_guard    16                                                                                                        
23 360_guard... 17                                                                                                      
23 360_guard... 18              

[work@xx:/data1/muce_temp/datamarts/reduce_result_file/302/1d/201212260000]$ LANG=En_US sort dd.dat     
23 340_guard    16                                                                                                        
23 340_guard    17                                                                                                        
23 340_guard    18                                                                                                        
23 360_guard    16                                                                                                        
23 360_guard... 16                                                                                                      
23 360_guard... 17                                                                                                      
23 360_guard... 18                                 

[work@xx:/data1/muce_temp/datamarts/reduce_result_file/302/1d/201212260000]$ LANG=en_US sort dd.dat     
23 340_guard    16                                                                                                        
23 340_guard    17                                                                                                        
23 340_guard    18                                                                                                        
23 360_guard... 16                                                                                                      
23 360_guard    16          (why this line appear here ? )                                                                                      
23 360_guard... 17                                                                                                      
23 360_guard... 18      

the format details of rows in this file likes:

2^E3^F360_guard^E...^I16^Ee^E17/18^I63776769$
2^E3^F360_guard^E^I16^Ee^E17/18^I63776769$
2^E3^F360_guard^E...^I17^Ei^E0^I63776771$
2^E3^F360_guard^E...^I18^Ei^E1^I63776773$

^E is '\x05' , ^F is '\x06', ^I is tab, $ is '\n' .

Thanks in advance.


Solution

  • The locale en_US invokes a smarter sorting algorithm that ignores those strings of dots like they would normally be ignored in manual sorting. The locale system is obviously being case-sensitive, and there is no En_US locale, so En_US is falling back to the default language (probably C).