Search code examples
arraysrubysortingalphanumeric

ruby alphanumeric sort not working as expected


Given the following array:

y = %w[A1 A2 B5 B12 A6 A8 B10 B3 B4 B8]
=> ["A1", "A2", "B5", "B12", "A6", "A8", "B10", "B3", "B4", "B8"]

With the expected sorted array to be:

=> ["A1", "A2", "A6", "A8", "B3", "B4", "B5", "B8", "B10", "B12"]

Using the following (vanilla) sort, I get:

irb(main):2557:0> y.sort{|a,b| puts "%s <=> %s = %s\n" % [a, b, a <=> b]; a <=> b}
A1 <=> A8 = -1
A8 <=> B8 = -1
A2 <=> A8 = -1
B5 <=> A8 = 1
B4 <=> A8 = 1
B3 <=> A8 = 1
B10 <=> A8 = 1
B12 <=> A8 = 1
A6 <=> A8 = -1
A1 <=> A2 = -1
A2 <=> A6 = -1
B12 <=> B3 = -1
B3 <=> B8 = -1
B5 <=> B3 = 1
B4 <=> B3 = 1
B10 <=> B3 = -1  # this appears to be wrong, looks like 1 is being compared, not 10.
B12 <=> B10 = 1
B5 <=> B4 = 1
B4 <=> B8 = -1
B5 <=> B8 = -1
=> ["A1", "A2", "A6", "A8", "B10", "B12", "B3", "B4", "B5", "B8"]

...which is obviously not what I desire. I know I can attempt to split on the alpha first and then sort the numerical, but it just seems like I shouldn't have to do that.

Possible big caveat: we're stuck using Ruby 1.8.7 for now :( But even Ruby 2.0.0 is doing the same thing. What am I missing here?

Suggestions?


Solution

  • You are sorting strings. Strings are sorted like strings, not like numbers. If you want to sort like numbers, then you should sort numbers, not strings. The string 'B10' is lexicographically smaller than the string 'B3', that's not something unique to Ruby, that's not even something unique to programming, that's how lexicographically sorting a piece of text works pretty much everywhere, in programming, databases, lexicons, dictionaries, phonebooks, etc.

    You should split your strings into their numerical and non-numerical components, and convert the numerical components to numbers. Array sorting is lexicographic, so this will end up sorting exactly right:

    y.sort_by {|s| # use `sort_by` for a keyed sort, not `sort`
      s.
        split(/(\d+)/). # split numeric parts from non-numeric
        map {|s| # the below parses numeric parts as decimals, ignores the rest
          begin Integer(s, 10); rescue ArgumentError; s end }}
    #=> ["A1", "A2", "A6", "A8", "B3", "B4", "B5", "B8", "B10", "B12"]