java multithreading algorithm concurrency java.util.concurrent

Java Concurrent - no speedUp gained LU algorithm - False sharing?

I have a question for you about a concurrent software on java.

The main algorithm of my application have to factor an A matrix into LU matrices : A = LU. The decompose method pasted here, makes the Gauss-Jordan reduction. The software is designed to work on square matrix with the position A[0][0] != 0.

Unfortunately, for the correct working of the algorithm, I have to wait that every row actualizes the values.

I tried to make this algorithm parallel using a barrier (for wait the actualization of every row and increment the "rigaCorrente" value) but I don't gain a real speedUp even if in the parallel version my processors (2.4 GHz Core 2 Duo P8600) work at 80% of the total (instead of 40-50% of the serial one).

My fear is that I'm encountering a false sharing situation. Is the problem related to the false sharing or to other matters? I know that the JVM performs good optimizations but it still uses the half power of the processor. I don't think it's impossible improve the algorithm!

My serial code:

    public void decompose(){
    int n = A.length;
    for(int k=0; k<n-1;k++) {
        for(int i=k+1; i<n; i++) {
            A[i][k] = A[i][k]/A[k][k];
            for(int j=k+1; j<n; j++) {
                A[i][j] = A[i][j] - A[i][k] * A[k][j];
            }
        }
    }
    decomposed = true;
}

My parallel code:

    public class Manager {
private double [][] A;

private Semaphore main = new Semaphore(0);
private CyclicBarrier barriera;

private AtomicInteger index;
private int rigaCorrente = 0;
private boolean stop = false;

public Manager(final double A[][], final int numThr){
    this.A = A;
    this.index = new AtomicInteger(1);

    barriera = new CyclicBarrier(numThr, new Runnable(){

        @Override
        public void run() {

        rigaCorrente++;
        index = new AtomicInteger(rigaCorrente+1);

    if(rigaCorrente == A.length - 1){
            setStop(true);
            main.release();
        }
    }

    });
}

The thread class:

    public class Deco implements Runnable {

private Manager manager;

public Deco(Manager manager){
    this.manager = manager;
}

@Override
public void run() {
    double [][] A = manager.getA();

while(manager.getStop() == false){
    int i;
    while((i = (manager.getIndex().getAndIncrement())) < (A.length)){
    double pivot = A[i][manager.getRigaCorrente()]/A[manager.getRigaCorrente()]                        [manager.getRigaCorrente()];

        for(int k = manager.getRigaCorrente(); k<A.length; k++)
            A[i][k] = A[i][k] - (pivot*A[manager.getRigaCorrente()][k]);

        A[i][manager.getRigaCorrente()] = pivot;
 }

        manager.acquireBarriera();

    }// Stop

}

   }

Main for parallel code:

    package luConcurrent.test;
    import java.util.Arrays;
    import java.util.Scanner;

    import lu.DecompositionLU;
    import lu.IO;

    public class Starter {
private static IO io;
private static DecompositionLU dec;
public static void main(String[] args) throws Exception {
    io = new IO("//Users//black//Desktop//serie//2500.txt");
    int numThr = 2;

    double [][] A = io.readMatrixFromInputFile();
    double [] b = io.readArrayFromInputFile();
    double [] x;

    dec = new DecompositionLU(A);

    System.out.println("A.length: "+A.length);

    Manager manager = new Manager(A,numThr);
    Thread[] pool = new Thread[numThr];

    for(int i=0; i<pool.length; i++){
        pool[i] = new Thread(new Deco(manager));
    }

    long inizio = System.nanoTime();

    for(int i = 0; i<pool.length; i++){
        pool[i].start();
    }

    manager.getMain().acquire();

    dec.ProgresiveSustitution(b);
    x = dec.RegresiveSustitution(b);

    long fine = System.nanoTime()-inizio;

    System.out.println("Solution is: "+Arrays.toString(x));
    Scanner sc = new Scanner(System.in);
    sc.nextLine();
    System.out.println("Tempo: "+fine);
    sc.close();

}
    }

Results:

    1000x1000 Serial: 1154679000 nanoSec
    1000x1000 Parallel 2 Threads: 1135663000 nanoSec

    1750x1750 Serial: 7502559000 nanoSec
    1750x1750 Parallel 2 Threads: 6667129000 nanoSec

    4000x4000 Serial: 89851311000 nanoSec
    4000x4000 Parallel 2 Threads: 84348616000 nanoSec

Solution

I wouldn't jump to the conclusion that false sharing is occurring. The parallel version of the algorithm adds a bunch of overhead that is probably responsible for reducing its performance below what you're expecting.

The serial version simply has three nested loops: an outer loop over k, a middle loop over i, and an inner loop over j. All it does is array access and arithmetic, so this should be quite fast.

The parallel version runs its outer loop over rigaCorrente (current row, if I'm not mistaken) using a CyclicBarrier at each iteration. This adds overhead. A cyclic barrier causes early-arriving threads to wait until the last has arrived. Well, you have only two threads, so the one that arrives first must wait for the second. That's some dead time there. Even if the threads finish at about the same time, there is overhead performing the barrier synchronization. And then one thread must wait while the other runs the barrier action.

The middle loop is over index which is an AtomicInteger fetched by the getIndex method call. The method call adds overhead, and getAndIncrement adds some contention between the threads.

The inner loop (confusingly over k instead of j as in the serial version) has a method call to getRigaCorrente inside of it. Sometimes -- but only sometimes -- the JVM can inline method calls. I don't see the implementation of getRigaCorrente here, but since rigaCorrente is a private instance variable of manager that isn't volatile, and it's read and written by multiple threads, perhaps this method is synchronized. That would add even more overhead.

The problem here is that the threads have to interact with shared state quite a bit during the run. This adds overhead and contention. I'd suggest trying to find a way to partition the work among threads in advance, then tell each thread what work it needs to do, and then have them all run independently.