Sorting of strings with fixed length

I want to sort Integers. Therefore I interpret them as binary strings of length 32. Now I am able to apply bucketsort for each component. Here is my implementation in java:

public static void sort(List<Integer> numbers)
{
    Queue<Integer> tmp =new LinkedList<Integer>();
    for (int i = 0; i < numbers.size(); i++) {
        tmp.offer(numbers.get(i));
    }

    List<Integer>[] bucket = new ArrayList[2];
    for (int i = 0; i < bucket.length; i++) {
        bucket[i] = new ArrayList<Integer>();
    }

    for (int k = 32; k > 0; k--) {
        // Clear
        for (int i = 0; i < bucket.length; i++) {
            bucket[i].clear();
        }

        while (!tmp.isEmpty()) {
            Integer firstel =tmp.element();
            String el =String.valueOf(firstel);

            if (el.charAt(k - 1) == 0) {
                bucket[0].add(firstel);
            } else {
                bucket[1].add(firstel);
            }
            tmp.remove();
        }

        for (int i = 0; i < bucket.length; i++) {
            for (Integer j : bucket[i]) {
                tmp.add(j);
            }
        }
    }
}

I am not sure, whether my code deals right with these binary strings. Do I have to convert the integers from the list numbers to binary strings? Note: It is only for practice. There is no deeper sense concerning time complexity.

Solution

Let's first rewrite this a bit, because there is no need for Queues, or a tmp arraylist, etc. Step one, a stub class that lets us write less code:

private class Numlist extends ArrayList<Integer> {
  Numlist() { super(); }
}

Done, we no longer need to write ArrayList<Integer> everywhere.

Now, java does "autoboxing" so anything that you store Integers in can do int, and vice versa. That's handy, because we can throw away all that string stuff. We don't need it if we just care about bit comparison:

public void sort(Numlist numbers) {
  // No copying `numbers` to a `tmp` queue: just work with it directly.
  Numlist zeroes, ones;

  for (int k = 1; k < 32; k++) {
    // Build this step's masking value, which we can use to
    // find the value of individual bits by using bitwise AND.
    // Also note that we _know_ this is a safe integer operation.
    mask = (int) Math.pow(2,k);

    // just allocate new lists; it's about as fast as clearing.
    zeroes = new Numlist();
    ones = new Numlist();

    // "scatter": now we empty the numbers list one element at a
    //  time, and then we'll fill it back up after we emptied it.
    while (!numbers.isEmpty()) {
      int e = numbers.remove(0);

      if ((e & mask) == mask) {
        ones.add(e);
      } else {
        zeroes.add(e);
      }        
    }

    // "gather"
    for (int i: zeroes) { numbers.add(i); }
    for (int i: ones) { numbers.add(i); }      
  }
}

And with that rewrite, things work. We removed a lot of verbosity which makes it easier to reason about what the code does, and we removed the whole "int to string to substring, then char comparison" which means there's a lot less to go wrong.

In terms of testing, add your function into the following class:

import java.lang.Math;
import java.util.ArrayList;

public class Test {

  // private class goes here

  public static void main(String[] argv) { new Test(); }

  public Test() {
    Numlist list = new Numlist();
    list.add(10);
    list.add(77810);
    list.add(4);
    list.add(100);
    list.add(1);
    list.add(290387423);
    list.add(23423);
    sort(list);
    System.out.println(list);
  }

  // function goes here
}

And done: javac will happily compile it, and executing should yield [1, 4, 10, 100, 23423, 77810, 290387423]

You'll also note we're not using for (int k = 31; k > 0; k--) but we're using for (int k=1; k<32; k++)... why? Does it make a difference?

it makes a huge difference

By running our mask from b000...001 to b100...000 we guarantee that despite "picking values back out", their relative "less than the current bit" ordering is preserved.

If we run our masking the other way, from b1000...000 to b000...001, then at every step we're undoing the ordering we just set up, and the result is not a sorted list at all: [1, 4, 100, 77810, 10, 290387423, 23423]

** edit **: why does masking work?

The integer types byte, char, int and long are just bit patterns of 1, 2, 4, and 8 bytes, respectively, so all of these are already "binary numbers", we just need a way to access individual bits, which we can do using bitwise masking.

In order to mask off "everything but a specific bit", you can take the bitwise AND of some bit pattern, and a pattern in which only a single bit is set, which we can create really easily since these are just numbers that are "powers of 2".

For instance, to check the bits in the number 23, we can use the following checks:

     23  &   1     2    4      8     16    32  

b0    1  &  1=1   0=0   0=0   0=0   0=0   0=0
b1    1  &  0=0   1=1   0=0   0=0   0=0   0=0
b2    1  &  0=0   0=0   1=1   0=0   0=0   0=0
b3    0  &  0=0   0=0   0=0   1=0   0=0   0=0
b4    1  &  0=0   0=0   0=0   0=0   1=1   0=0
b5    0  &  0=0   0=0   0=0   0=0   0=0   1=0

Here we can see the number 23, binary 10111, and the result of masking by each power of two: 23 & 1 yields 1, so we know the first bit is set. We see that 23 & 2 yields 2, so we know the second bit is set. The same for 4, but 23 & 8 yields 0. We know that the fourth bit is not set.

So we can check any length bit pattern by using bitwise AND: if (number & mask) == mask we know that the bit for that mask is set. If the result is 0 instead, we know the bit was not set.

Also, note that & is not the same as &&: & is a bitwise AND operation, performing an AND for every single bit between the left and right hand side of the operator. The && operator is logical AND, requiring the left and right sides to boolean values. The logic AND is effectively "a single AND", whereas the bitwise AND is "as many AND operations as there are bits to test".