How can JVM implementations like Jython and JRuby beat their native counterparts?

I was watching this video here, where Robert Nicholson discusses P8, an implementation of PHP on the JVM. At some point he mentions that they aim to surpass native PHP in performance some time in the future.

He mentions JRuby and Jython, which started out slower than their native counterparts, but eventually surpassed them. Quercus, another PHP interpreter on the JVM claims to be 4x faster than mod_php and is also worth of note.

Does that mean that the general idea that the JVM is slower than C is wrong, or are there flaws in the original C implementations?

Solution

Does that mean that the general idea that the JVM is slower than C is wrong, or are there >flaws in the original C implementations?

A bit of both

The JVM has been around for a long time and has made significant progress in efficiency. The garbage collection, jitting, caching and other areas are more advanced than in 'reference' implementations such as PHP.

Anyone taking a look under the hood of PHP will understand why efficiency gains are easy to achieve.

~~I am personally doubtful that the JVM can outperform the CPython however~~ ... but I could be wrong ... I am, this is down to the JVM GC being faster, and IronPython too. Performance improvements may be a non-reliance on the C call stack such as in stackless Python. The Jython site states

Jython is approximately as fast as CPython--sometimes faster, sometimes slower. Because >most JVMs--certainly the fastest ones--do long running, hot code will run faster overtime.

Which I can appricate as fact as the JVM will reach C performance levels as caches generate and so on basically negate the higher level aspects to the VM implementation code (a large part of which is written in C anyway)

In many interpreted languages such as PHP and Python are just bridges to equivalent C calls and dives into machine code. In the JVM, the Jitter performs a similar function by reducing the bytecode to machine-code equivalents. Eventually, the intermediate representations such as the high-level syntax and bytecode are usually reduced to C-speed or faster CPU operations anyway ... so it is all the same, just more intermediate steps which only affects the latency to full efficiency when loading new code. There comes a point in RAM where you say "what is the real difference?" and the answer is only the process that gets it there and the final representation that determines the speed of stack winding, garbage collection algorithms, register usage and logic representation such as arithmetic.