Search code examples
performanceprogramming-languagesubuntu-12.04execution-timetowers-of-hanoi

Unexpected performance numbers using Towers of Henoi solution


Here is a performance puzzler: I'm running on a Intel Celeron 1.8GHz, 3GB RAM, Ubuntu 12.04 32-bit box. I wrote the Towers of Henoi recursive solution in C, Python & Java. I ran the same for n = 25 and was shocked to see the following times:

performace numbers

I assumed that Java would perform better than Python but it looks like Java is in orders of magnitude slower than C and Python on this box. To neutralize any code level performance influence-rs, I've kept the code to it's most elementary form. Here are the 3 solutions:

henoi.c

#include <stdio.h>

void move(int n, char src, char intr, char dest) {
    if (n > 0) {
        move(n-1, src, dest, intr);
        printf("%c -> %c\n", src, dest);
        move(n-1, intr, src, dest);
     }
}

int main() {
    move(25, 'A', 'B', 'C');
    return 0;
}

henoi.py

def move(n, src, intr, dest):
    if n > 0:
        move(n-1, src, dest, intr)
        print '%s -> %s' % (src, dest)
        move(n-1, intr, src, dest)

move(n=25, src='A', intr='B', dest='C') 

Henoi.java

public class Henoi {

    static void move(int n, char src, char intr, char dest) {
        if (n > 0) {
            move(n-1, src, dest, intr);
            System.out.println(src + " -> " + dest);
            move(n-1, intr, src, dest);
        }
    }

    public static void main(String[] args) {
        move(25, 'A', 'B', 'C');
    }
}

I'm using gcc version 4.6.3, Python 2.7.3, and Sun Java 1.6.0_24.

Can someone please help me understand the above figures ???

Erratum: I noticed much later the I've spelt "Hanoi" as "Henoi" everywhere. Pardon me !


Solution

  • You are almost exlusively benchmarking the performance of the standard output stream. Java tries to flush everything to stdout as soon as possible. So with a bit of extra buffering I got the runtime down to 10 seconds:

    public class Henoi {
    
        static final PrintStream bout = new PrintStream(
                          new BufferedOutputStream( System.out ) );
    
        static void move(int n, char src, char intr, char dest) {
            if (n > 0) {
                move(n-1, src, dest, intr);
                bout.println(src + " -> " + dest);
                move(n-1, intr, src, dest);
            }
        }
    
        public static void main(String[] args) {
            move(25, 'A', 'B', 'C');
            bout.flush();
        }
    }
    

    You now need to flush the buffer before the program exits to make sure that everything was written.

    The other aspect that makes this Java program slow is that you create new String objects for every line which also need to be converted fom 16 bit chars to 8 bit bytes while you already use 8 bit chars in C. When you would avoid this you get another 10 times performance increase.

    public class Henoi {
    
        static final byte[] barr = "x -> y\n".getBytes();
        static final OutputStream bout = new BufferedOutputStream( System.out );
        static void move(int n, char src, char intr, char dest) throws IOException {
            if (n > 0) {
                move(n-1, src, dest, intr);
                barr[ 0 ] = (byte) src;
                barr[ 5 ] = (byte) dest;
                bout.write( barr, 0, barr.length );
                // System.out.println(src + " -> " + dest);
                move(n-1, intr, src, dest);
            }
        }
    
        public static void main(String[] args) {
            try {
                move(25, 'A', 'B', 'C');
                bout.flush();
            } catch( IOException ex ) {
                ex.printStackTrace();
            }
        }
    }
    

    It now finishes in 920 Milliseconds on my PC (writing to RAM disk) and I guess with a little more tunig (there is still lots of redundant byte array copying) it could finish in less than half of that. For comparision, your original code took almost exactly 2 minutes in the same environment (i7 4.5GHz, Windows 7 x64, Java 1.7). The unaltered C version compiled with Visual C++ 2010 x64 and /Ox (all optimizations on) takes 6.650 seconds.

    EDIT: Here is the C version with similar changes which takes 950ms to run now:

    #include <stdio.h>
    
    char *str;
    
    void move(int n, char src, char intr, char dest) {
        if (n > 0) {
            move(n-1, src, dest, intr);
            str[ 0 ] = src;
            str[ 5 ] = dest;
            _fwrite_nolock( str, 7, 1, stdout );
            // printf("%c -> %c\n", src, dest);
            move(n-1, intr, src, dest);
         }
    }
    
    int main() {
        char mem[] = "x -> 5\n";
        str = mem;
        setvbuf( stdout, NULL, _IOFBF, 8192 );
        _setmode( 1 /* stdout */, 0x8000 /* _O_BINARY */ );
        move(25, 'A', 'B', 'C');
        _fflush_nolock( stdout );
        return 0;
    }
    

    The _setmode() call avoids the line feed translation that would otherwise be aplied on Windows (\n->\r\n) and thus cause a larger output. _fwrite_nolock is the same as fwrite but without locking and about twice as fast in this scanario.