Search code examples
c++g++mpisunos

Program without mpi instrucions is extremely slow in mpirun


I'm writing now a program to study MPI. Okay, I'd write a program that multiplies square matrices.

long **multiplyMatrices(long **matrix1, long **matrix2, long capacity)
{
    long **resultMatrix = new long*[capacity];

    for (long i = 0; i < capacity; ++i) {
        resultMatrix[i] = new long[capacity];
    }

    for (long i = 0, j, k; i < capacity; ++i) {
        for (j = 0; j < capacity; ++j) {
            resultMatrix[i][j] = 0;

            for (k = 0; k < capacity; ++k) {
                resultMatrix[i][j] = resultMatrix[i][j] + matrix1[i][k] * matrix2[k][j];
            }
        }
    }

    return resultMatrix;
}

Where capacity == 1000.

Okay, on localhost (Mac Mini 2012, Core i7, OS X 10.8.2) I compile this code in XCode with LLVM. Calculation takes 17 seconds. Yes, in one thread.

On remote host (Sun OS 5.11, dual-core CPU, 8 vCPU) I compile it with

g++ -I/usr/openmpi/ompi-1.5/include -I/usr/openmpi/ompi-1.5/include/openmpi -O2 main.cpp -R/opt/mx/lib -R/usr/openmpi/ompi-1.5/lib -L/usr/openmpi/ompi-1.5/lib -lmpi -lopen-rte -lopen-pal -lnsl -lrt -lm -ldl -lsocket -o main

or just

g++ -O2 main.cpp -o main

But... mpirun main takes 152 seconds to calculate this... What's wrong? Am I missing something? Is that's about server's CPU's architecture?


Solution

  • The main answer is in memory management.

    Look at those lines

    long **resultMatrix = new long*[capacity];
    
    for (long i = 0; i < capacity; ++i) {
        resultMatrix[i] = new long[capacity];
    }
    

    All lines are located in different places of memory, not as a whole block. We know how physical memory are presented on Mac Mini — 2 pieces of plastic, but on server it may be even different hosts (cluster).

    Now we'll try to fix this.

    long **allocateMatrix(long capacity)
    {
        // Allocating a vector of pointers to rows
        long **matrix = (long **)malloc(capacity * sizeof(long *));
    
        // Allocating a matrix as a whole block
        matrix[0] = (long *)malloc(capacity * capacity * sizeof(long));
    
        // Initializing a vector of pointers with rows of addresses
        long *lineAddress = matrix[0];
        for(long i = 0; i < capacity; ++i) {
            matrix[i] = lineAddress;
            lineAddress += capacity;
        }
    
        return matrix;
    }
    
    void deallocateMatrix(long **matrix, long capacity)
    {
        free(matrix[0]);
        free(matrix);
    }
    

    This boosts code running on Mac Mini to 9.8 seconds, on server — to 58 seconds.

    But I still don't know where are other time leaks. Maybe I should somehow optimize looping one of matrices.