c multithreading parallel-processing pthreads openmp

Parallelizing Simpson's Method in C using pthreads and OpenMp

I am fairly new to the whole multiprocessing and parallelizing world. I am currently working on an algorithm to implement Simpson's Method and Simpson 3/8. I have already implemented the algorithm in C in its serial form.

I need some help to parallelize both of this algorithms using, OpenMp and Pthreads. Any recommendation is welcome and appreciated.

//Simpson.c
#include <stdio.h>
#include <stdlib.h>

double function(double x) //función a la cual se le aplicará el método
{
  return 0.5*x;    
}

int main(int argc, char*argv[])
{
  double resultado = 0.0; //resultado de la integral tras aplicar el método de simpson
  double a = 0.0;  //límite inferior de la integral
  double b = 0.0;  //límite superior de la integral
  int n = 0; //cantidad de particiones entre los intervalos, tiene que ser un número par
  int i = 0; //variable para poder iterar sobre las diferentes particiones
  double h = 0.0; //variable para guardar el incremento que habrá entre cada valor de "x"
  double calc = 0.0; //variable intermedia para ir acumulando el resultado de evaluar las diferentes "x" en la función
  
  //variables para almacenar los diferentes valores de "x", es decir los límites inferiores y superiores y el valor intermedio de cada partición  
  double x0 = 0.0;
  double x1 = 0.0;
  double x2 = 0.0;  
    
  printf("Introduce el límite inferior: ");
  scanf("%lf", &a);
  printf("Introduce el límite superior: ");
  scanf("%lf", &b);
  printf("Introduce la cantidad de particiones (número par): ");
  scanf("%d", &n);  
    
  h = (b-a)/n; //se calcula el incremento para poder iterar sobre todas las particiones
    
      //Para el cálculo de la integral se utilizan xj, x(j+1) y x(j+2) en el subintervalo [xj, x(j+2)] para cada i = 1, 3, 5, ..., n
    
  for (i=0; i<=n; i+=2)
  {   
    x0 = a + h*i; //límite inferior
    calc += function(x0); //se evalua la x en la función
    
    x1 = a + h*(i+1); //valor intermedio
    calc += 4*function(x1);
      
    x2 = a + h*(i+2); //límite superior
    calc += function(x2);
      
    calc *= ((x2-x0)/6) ; //se vuelve a calcular el incremento entre nuestros límites, y se divide entre 6
      
    resultado += calc; //variable para ir acumulando los cálculos intermedios de cada "x"
  }
    
  printf("La aproximación a la integral es: %f \n", resultado);

}

#include<stdio.h>
#include<math.h>

// función a integrar
#define f(x) (0.5*x)

int main()
{
  float b; //Limite superior
  float a; //Limite inferior
  float resultado=0.0; 
  float dx;
  float k;
  int i; 
  int N;  //numero de intervalos o iteraciones

  
 // Entradas
 printf("Dame el limite inferior: ");
 scanf("%f", &a);
 printf("Dame el limite superior: ");
 scanf("%f", &b);
 printf("Número de iteraciones/intervalos (número par): ");
 scanf("%d", &N);

  //Calculando delta de x
  dx = (b - a)/N;
  
  //Sumas de la integracion
  resultado = f(a) + f(b); //primera suma
  
  for(i=1; i<= N-1; i++)
  {
    k = a + i*dx; //Suma del intervalo
    
    if(i%3 == 0)
    {
      resultado = resultado + 2 * f(k);
    }
    else
    {
      resultado = resultado + 3 * f(k);
    }
  }
  
  resultado = resultado * dx*3/8;
  
  printf("\nEl valor de la integral es: %f \n", resultado);
  printf("\n");
  
  return 0;
}

Solution

My first recommendation is to start by reading a lot about the shared-memory paradigm and its problems. Then read about Pthreads and OpenMP, and only then look at concrete OpenMP and Pthread parallelizations, and how they deal with some of the issues of the shared-memory paradigm. If you have to learn both Pthreads and OpenMP, in my opinion, if I were you I would start by looking into Pthread first and only then looking into OpenMP. The latter helps with some of the cumbersome details, and pitfalls, of coding with the former.

I will try to give you a very general (not super detailed) formula on how to approach a parallelization of a code, and I will be using as an example your second code and OpenMP.

If you opt to parallelize an algorithm, first you need to look at the part of the code that is worth parallelization (i.e., the overhead of the parallelization is overshadowed by the speedups that it brings). For this profiling is fundamental, however, for smaller codes like yours it is pretty straightforward to figure it out, it will be typically loops, for instance:

for(i=1; i<= N-1; i++)
  {
    k = a + i*dx; //Suma del intervalo
    
    if(i%3 == 0)
    {
      resultado = resultado + 2 * f(k);
    }
    else
    {
      resultado = resultado + 3 * f(k);
    }
  }

To parallelize this code, which constructor should I use tasks ? parallel for ? you can look at the great answers to this topic on this SO thread. In your code, it is pretty clear that we should use parallel for (i.e., #pragma omp parallel for). Consequently, we are telling to OpenMP to divide the iterations of that loop among the threads.

Now you should think about if the variables used inside the parallel region should be private or shared among threads. For that openMP offers constructors like private, firstprivate, lastprivate and shared. Another great SO thread about the subject can be found here. To evaluate this you really need to understand the shared memory paradigm and how OpenMP handles it.

Then look at potential race conditions, interdependencies between variables and so on. OpenMP offers constructors like critical, atomic operations, reductions among others to solve those issues (other great SO threads atomic vs critical and reduction vs atomic). TL;DR : You should opt for the solution that gives you the best performance without compromising correctness.

In your case reduction of the variable "resultado" is clearly the best option:

 #pragma omp parallel reduction ( +: resultado) 
 for(i=1; i<= N-1; i++)
  {
    k = a + i*dx; //Suma del intervalo
    
    ... 
  }

After you guarantee the correctness of your code you can look at load unbalancing problems (i.e., the difference among the work performed by each thread). To deal with this issue openMP offers parallel for scheduling strategies like dynamic and guided (another SO thread) and the possibility of tuning the chunk size of those distributions.

In your code, threads will execute the same amount of parallel work, so no need to use a dynamic or guided schedule, you can opt for the static one instead. The benefit of the static over the others is that it does not introduce the overhead of having to coordinate the distributions of tasks among threads at-runtime.

There is much more stuff to look into to extract the maximum performance out of the architecture that your code is running on. But for now, I think it is a good enough start.