I have a program that is making a huge number of calls to Long.bitCount(), so many that it is taking 33% of cycles on one CPU core. Is there a way to implement it that is faster than the Sun JDK version?
I have tried:
But I couldn't do any better than a 216-entry lookup table with a manually-unrolled loop (about 27% CPU.)
How else might this be optimized for Java?
Note: this question is about Java-specific optimization, but this similar (language-agnostic) question has many other algorithms.
If you are on a recent x86 CPU there is an instruction for this, popcnt.
In recent versions of Java, Long.bitCount() uses this instruction. Just use -XX:+UsePopCountInstruction (this is the default in recent versions)
However, there are some bugs with it in JRE 6.0_u18 through 7.0_u5: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7063674