Performance impact of Java Streams in Matrix traversal

I have two versions of some code to diagonally traverse an MxN matrix. Apparently, even with the same (bad) traversal logic the version which uses Java Streams run 5x slower that the one which doesn't.

Constraints:

1 <= M, N <= 10^4
1 <= M * N <= 10^4

Version 1: Using for loops:

public int[] findDiagonalOrder(int[][] mat) {
        int M = mat.length;
        int N = mat[0].length;

        int[] traversal = new int[M * N];
        int s = 0;
        for(int sum=0; sum < M+N; sum++) {
            for(int k=0; k<=sum; k++) {
                int row = (sum%2 == 0) ? sum - k : k;
                if (row < M && (sum-row < N))
                    traversal[s++] = mat[row][sum-row];
            }
        }
        return traversal;
}

Version 2: Using Java Streams:

    public int[] findDiagonalOrder(int[][] matrix) {
        int M = matrix.length;
        int N = matrix[0].length;

        return IntStream.range(0, M+N)
                .flatMap(sum -> IntStream.rangeClosed(0, sum)
                        .filter(i -> {
                            int row = (sum%2 == 0) ? (sum-i) : i;
                            return row < M && (sum - row) < N;
                        })
                        .map(i -> (sum%2 == 0) ? matrix[sum-i][i] : matrix[i][sum-i]))
                .toArray();
    }

Version #1 runs in approximately 353 msec, while Version #2 runs in 1500 msec. I wanted to understand, what is causing this degradation in performance? I'm not even using Boxing/Unboxing which may have had an impact.

PS: I understand there is a better way for matrix traversal in O(M*N), but my question is about Streams.

Solution

I wanted to understand, what is causing this degradation in performance?

The short answer is probably stream overheads.

Performance is not the primary goal of Java streams. In current versions of Java, the compilers are not (yet) able to transform stream-based code into native code that performs as well as classic loops. In the future this may change. However, significant R&D effort would need to make that happen, and (AFAIK) this is not currently a priority for the Java team.

So the simplistic "corollary" is that if performance¹ is the most important requirement for a given code fragment², you shouldn't be using streams in the solution.

I'm not even using Boxing/Unboxing which may have had an impact.

Not explicitly. But it could be happening under the covers. The only way to be absolutely sure would be to analyze all of the native code emitted by the JIT compiler³. Profiling your benchmark may also give you the answer ... if the answer is that there >is< boxing / unboxing behind the scenes.

Finally, since you haven't shown us your complete benchmark, it is quite possible that your results (353 msec versus 1500 msec) are not representative of how the corresponding code would perform in a production context. Read How do I write a correct micro-benchmark in Java? and the various sources that the answers cite.

^{1 - Actual performance right now, rather than potential performance in the future.

2 - However this is typically NOT the case. Even in a performance critical application, most of the lines of code for that application don't have significant impact on overall performance.

3 - Bytecodes do not give a reliable indication of code performance. Optimization is done by the JIT compiler not the bytecode compiler. There could potentially be instructions to do boxing and unboxing in the bytecodes that are completely optimized away in the native code.}