I've noticed surprising difference between performance of Itarable.sum()
and direct for loop with manual sum. Consider this:
import kotlin.system.measureTimeMillis
fun main(args: Array<String>) {
var sink = 0;
repeat(5) {
println(measureTimeMillis {
var sum = 0
for (i in 1..10_000_000) {
sum += i
}
sink += sum
})
}
repeat(5) {
println(measureTimeMillis {
sink += (1..10_000_000).sum()
})
}
}
Surprisingly, using Iterable.sum()
is up to 10 times slower,
compared to the code that is almost identical to sum() implementation.
Why is that?
Update:
When I target js, then sum() is only slightly slower.
measureTimeMillis()
can be defined as:
import kotlin.js.Date
public inline fun measureTimeMillis(block: () -> Unit): Double {
val start = Date.now()
block()
return Date.now() - start
}
Update2:
On same linux machine, jvm sum() is even slower than js. Here are results for 100_000_000 iterations for jvm (Oracle jdk9) and js (latest chrome):
105 // jvm raw loop
76 // jvm raw loop (jit?)
75 // jvm raw loop (jit?)
75 // jvm raw loop (jit?)
70 // jvm raw loop (jit?)
633 // jvm sum()
431 // jvm sum()
562 // jvm sum()
327 // jvm sum() (jit?)
332 // jvm sum() (jit?)
110 // js raw loop
108 // js raw loop
232 // js raw loop
227 // js raw loop
227 // js raw loop
321 // js sum()
284 // js sum()
264 // js sum()
266 // js sum()
265 // js sum()
So, on same machine, jvm seems to be slower than js when using sum()
. Yet another surprise.
Clearly, we're comparing super-optimized tight loops here. I'm seeing quite stable results across repetitions for the "manual sum" and wild variance in the "built-in" case. This indicates GC activity.
Upon launching VisualVM and using its VisualGC plugin, I confirmed that there's no GC activity during the manual sum computation, but a lot of it in the built-in case.
Looking at the generated bytecode, the difference becomes apparent: the idiom for (i in 1..range) { ... }
compiles directly into a counted loop. This is actually documented:
Integral type ranges (
IntRange
,LongRange
,CharRange
) have an extra feature: they can be iterated over. The compiler takes care of converting this analogously to Java's indexed for-loop, without extra overhead.
Unfortunately, the same optimization doesn't apply to the extension function Iterable.sum()
because it must work for any Iterable
. The compiler could see what's going on and introduce another intrinsic, which would simply convert the whole thing into the resulting sum without computation, or use a direct formula if the range bounds aren't hardcoded.
JavaScript is on a similar footing here because it, too, has a powerful JIT compiler. I can't comment anything specific, but it most probably avoids allocation in the hot loop.