Search code examples
cparallel-processingopenmp

Segmentation fault(core dumped) using openmp


I'm currently working in a program that multiplies matrices in c, it receives the size of the matrices in the same line of execution,the program works for matrices below size 900 but when reaching matrices of more than size 900 I'm receiving the error segmentation fault(core dumped)

After checking many ressources available on internet I'm still unable to solve the problem, here is the code that I'm using:

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <sys/time.h>

int main(int argc, char* argv[]) 
{
    int tam = atoi(argv[1]);
    int first[tam][tam];
    int second[tam][tam];
    int third[tam][tam];
    int i,j,k,l,f;
    srand(time(NULL));
    omp_set_num_threads(omp_get_num_procs());
    for (i= 0; i< tam; i++)
        for (j= 0; j< tam; j++)
    {
            l = rand();
            f = rand();
            first[i][j] = l;
            second[i][j] = f;
    }
    #pragma omp parallel for private(i,j,k) shared(first,second,third)
    for (i = 0; i < tam; ++i) {
        for (j = 0; j < tam; ++j) {
            for (k = 0; k < tam; ++k) {
                third[i][j] += first[i][k] * second[k][j];
            }
        }
    }

}

Solution

  • Root-cause of the [segfault] is not OpenMP, but huge STACK allocations

    You might have noticed, that the 900 was not a fixed threshold for going into [segfault] crashes.

    Yet, let's show one of the several ways forward: put the data on HEAP, instead of letting it get allocated on STACK:

    The TiO.RUN online run-verifiable demo is ready to click and run here ( Array sizes above ~ 8000 x 8000 x 8-B x 3-matrices start to get soft-killed by the platform, as runtimes grow beyond the online-platform demo quota of about one minute CPU-time ).

    #include <pthread.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <omp.h>
    #include <sys/time.h>
    
    #define             cTAM 1234
    static int m3[cTAM][cTAM],              // HEAP-allocated data does not devastate STACK-resources
               m2[cTAM][cTAM],              // HEAP-allocated data does not devastate STACK-resources
               m1[cTAM][cTAM];              // HEAP-allocated data does not devastate STACK-resources
    
    int main(int argc, char* argv[]) 
    {
        int             tam = cTAM;         // proxy for not accessible command-line parameter passing
    //  int first[ tam][tam];
    //  int second[tam][tam];
    //  int third[ tam][tam];
    
        int i,j,k,l,f;
    
        srand(time(NULL));
    // ------------------------------------ // root-cause problem is NOT related to the art of OpenMP
    //  omp_set_num_threads(omp_get_num_procs());
        for     ( i = 0; i < tam; i++ )
            for ( j = 0; j < tam; j++ )
        {
                l = rand();
                f = rand();
            //  first[i][j] = l;
            //  second[i][j] = f;
                m1[i][j] = l;
                m2[i][j] = f;
        }
    // ------------------------------------ // root-cause problem is NOT related to the art of OpenMP
    //  #pragma omp parallel for private(i,j,k) shared(first,second,third)
        for         ( i = 0; i < tam; ++i ) {
            for     ( j = 0; j < tam; ++j ) {
                for ( k = 0; k < tam; ++k ) {
                //   third[i][j] += first[i][k] * second[k][j];
                     m3[i][j]    += ( m1[i][k]
                                    + m2[   k][j]
                                      );
                }
            }
        }
        return( 1 );
    }
    

    The further OpenMP tricks are another subject, not related to [segfaults]

    • refactor the indexing, so that better cache-line aligned processing re-uses row-wise data-elements from already fetched cache-line "neighbours" and best if shared access to third[][]-cells is completely avoided ( performance boosts will reward these efforts )
    • performance will get boosted, if first[][] ( or m1[][] above ) will become transposed, as the iteration will follow the row-wise memory-blocks already pre-fetched into cache-line ( data-access costs will drop from about ~1E2 [ns] down to ~0.5 [ns] on cache-hits, i.e. ~ 200 X )

    It would be worth reading more and learning the technical details about cache-hierarchies, coalescing memory-access-patterns and about smart cache-line tricks to increase the cache-line re-use, instead of re-fetching the data that was once already fetched by a poor order of index iterations. You had the same problem with the GPU-kernel code, so indeed worth to spend a few days on a deep dive into these technology tricks.