Is javac failing on optimize my code?

I have been testing cpu cache optimization and one simple test I did was summing a 2048x2048 matrix of integers inside nested loops, firstly i tried with sequential index(always jumping to the next memory integer) and then with a long width memory access(jumping to the 2048th next integer), I ran these pieces of code 1000 times each one and then I got the average execution time of each one, the test was executed on an Intel Core 2 Duo E4600 2.4 GHz with no other process that could harm the execution time on background and the results were:

Sequential index: average: 8ms.

Long width memory access: average: 43.

Here is the source code:

import java.util.Arrays;
import java.util.Random;

public class Main {

    public static void main(String[] args) {
        int [][] matrix = new int[2048][2048];
        
        long tInitial = 0;
        long tFinal = 0;
        
        long [] avgExecTime = new long[1000];
        
        for (int i = 0; i < avgExecTime.length; i++) {
            fillWithRandomNumbers(matrix);
            
            tInitial = System.currentTimeMillis();
            
            sumSequentially(matrix);
            
            tFinal = System.currentTimeMillis();
            
            avgExecTime[i] = tFinal - tInitial;
        }
        
        System.out.println(Arrays.stream(avgExecTime).sum() / avgExecTime.length);
        
        for (int i = 0; i < avgExecTime.length; i++) {
            fillWithRandomNumbers(matrix);
            
            tInitial = System.currentTimeMillis();
            
            sumRandomly(matrix);
            
            tFinal = System.currentTimeMillis();
            
            avgExecTime[i] = tFinal - tInitial;
        }
        
        System.out.println(Arrays.stream(avgExecTime).sum() / avgExecTime.length);
    }
    
    public static void fillWithRandomNumbers(int [][] matrix) {
        Random r = new Random();
        for (int i = 0; i < matrix.length; i++) {
            for (int j = 0; j < matrix[i].length; j++) {
                matrix[i][j] = r.nextInt();
            }
        }
    }
    
    public static long sumSequentially(int [][] matrix) {
        long total = 0;
        
        for (int i = 0; i < matrix.length; i++) {
            for (int j = 0; j < matrix[i].length; j++) {
                total += matrix[i][j];
            }
        }
        
        return total;
    }
    
    public static long sumRandomly(int [][] matrix) {
        long total = 0;
        
        for (int i = 0; i < matrix.length; i++) {
            for (int j = 0; j < matrix[i].length; j++) {
                total += matrix[j][i];
            }
        }
        
        return total;
    }
}

My questions are:

Why didn't java compiler optimized the method sumRandomly considering that sumRandomly and sumSequentially have the same final result?
Would java compiler be considering that could exist a thread changing my matrix parameter values while the sum is happening and that could change the results?

I have already tried without using methods and doing everything on the main method and the result is almost the same.

Solution

As you can see in hexafraction's answer,the type of optimizations you are requiring are not viable. Anyway, there is a tiny optimization possible that i believe java compiler is taking, and that is because the times are different.

I am posting this answer just to make these terms become more visible, because i think they are important.

The array access function

In Java language, all the called non-primitive variables are references, it means that the data is not directly accessible, and probably not sequentially stored.

Your data type is long, that is primitive, but you are storing them into an array, that is not primitive. When you try to access the values of your array like matrix[i][j] the JVM will have to perform at least three operations. Find the reference for the matrix object, find the reference to the [i] item and then finally find the reference to your value [j].

As i said, the values will probably not be sequentially stored, so in most cases, the caches will not help too much. Looking to your code we can see two different forms of access to the object, matrix[i][j] and matrix[j][i] always i with the outer loop index and j the index of the inside loop. In this scenario we can make a simple optimization on the first case, because the value of the first array access will return the same on all the j loops, so the compiler can see this and say, "Wait there! I don't need to find the same result every loop. I will cache this!", so your code will evaluate to something like:

public static long sumSequentially(int [][] matrix) {
    long total = 0;

    for (int i = 0; i < matrix.length; i++) {
        long[matrix[i].length] matrixi = matrix[i];
        for (int j = 0; j < matrixi.length; j++) {
            total += matrixi[j];
        }
    }

    return total;
}

On the second method, this is not real. The j index changes every internal loop, changing with him the first and second array access result in all loops.

The cpu cache

I've made this example in C, because C is a valid scenario for the CPU caching, that i believe is your main doubt.

The results were:

Sequential sum average: 2ms.

Random sum average: 39ms.

So, we have almost the same scenario. C stores data in arrays sequentially, in multiple dimensions matrices we have (considering char[2][2] as the type):

+--------------+------------+
| Ram address  | Item index | 
+--------------+------------+
| 00000120000  | [0][0]     | 
+--------------+------------+
| 00000120001  | [0][1]     | 
+--------------+------------+
| 00000120002  | [1][0]     | 
+--------------+------------+
| 00000120003  | [1][1]     | 
+--------------+------------+

The CPU caches data and instructions to don't need go back to the RAM to have them. When we are accessing data sequentially, the CPU fills the cache with the data next to the one we are getting, then he will have the next N values cached when we access one, the problem is when we have to access the N + 1 value, so the cache needs to get validate and there will come a batch of data moved from RAM.

When accessing data in the second way, you will jump various RAM address and then back. Almost every jump you make you will invalidate the cache, and the cache will be filled again, so you need more operations and the process goes slower. If you want to know more, i saw a Gallery of Processor cache effects some time ago.

The both languages have a similar behavior, but for different reasons i guess that is because you got confused.