OMP parallel nested for cycles

I'm trying to multiply two n by n matrices using omp in C. But my code only calculates the diagonal elements. Here is the code:

My code

# include <stdio.h>
# include <omp.h>

int main(void){

  int N, element, tid;
  int i, j;
  scanf("%d", &N); // matrix dimension
  
  int A[N][N]; // matrix A
  int B[N][N]; // matrix B
  int C[N][N]; // matrix C := AB
  
  // declare elements matrix A
  for(i = 0; i < N; i++){
    for(j = 0; j < N; j++){
      scanf("%d", &element);
      A[i][j] = element;
    }
  }

  // declare elements matrix B
  for(i = 0; i < N; i++){
    for(j = 0; j < N; j++){
      scanf("%d", &element);
      B[i][j] = element;
    }
  }
  
  omp_set_num_threads(N); // set the number of threads

  #pragma omp parallel default(none) shared(A, B, C, N) private(i, j, tid) 
  { 
    tid = omp_get_thread_num(); // returns the number of the currently executing thread within the team
    #pragma omp for 
    for (i = 0; i < N; i++){
      C[tid][i] = 0;
      for (j = 0; j < N; j++){
        //tid = i*n + j;
        C[tid][i] += A[tid][j]*B[j][i];
      }
    }
  }

  // print matrix C 
  for(i = 0; i < N; i++){
    for(j = 0; j < N; j++){
      printf("%d ", C[i][j]);
    }
    printf("\n");
  }

  return(0);
}\

This is what I get with N = 2 and matrices A = {{1, 2}, {3, 4}} B = {{1, 2}, {3, 4}}

C = {{7, random number}, {random number, 22}}

Solution

Let's look at how your code behaves:

#pragma omp parallel default(none) shared(A, B, C, N) private(i, j, tid)

Creates N threads, that will all execute the next instructions.

#pragma omp for 
    for (i = 0; i < N; i++){

Distributes the N iterations to the N threads. With the default omp scheduling, the first thread (tid=0) will execute only the first iteration (i=0), the second thread (tid=1) only the second iteration (i=1), and so on... This means that in the loop body you always get i = tid, and consequently C[tid][i] addresses the diagonal elements only.

You should obtain the desired result by suppressing #pragma omp for, so that the loop on i is fully executed by each thread. But as mentioned in the comments, this approach with the creation N threads will give poor performances as soon as N will be larger then the number of cores (and won't use all the cores when N will be smaller than the number of cores).

EDIT

The right way is instead to not change the default number of the threads (which is usually equal to the number of physical cores), and define a parallel loop that distribute the rows to the threads:

  #pragma omp parallel default(none) shared(A, B, C, N) 
  { 
    #pragma omp for
    for (int k = 0; k < N; i++) {
      for (int i = 0; i < N; i++){
        C[k][i] = 0;
        for (int j = 0; j < N; j++){
          C[k][i] += A[k][j]*B[j][i];
        }
      }
    }
  }

If N=15 for instance, and with 4 threads:

the 1st thread will process the iterations for k=0,1,2,3
the 2nd thread will process the iterations for k=4,5,6,7
the 3rd thread will process the iterations for k=8,9,10,11
the 4th thread will process the iterations for k=12,13,14

But note that this code is likely not efficient, even if executed serially without OpenMP. The memory access pattern on B in the inner loop is poor, as it addresses non-contiguous elements in memory. For small N it doesn't matter, but for large N it could result in a lot of cache misses, then altered performances.