pthread running thread concurrently c

I need to make the Leibniz algorithm with pthread in c, now I have this code, but at the moment the threads implementation takes the same time of the sequential implementation, I think it is not running concurrently. Can someone see the error.

Thanks!!

#include<stdio.h>
#include<math.h>
#include<pthread.h>
#include<stdlib.h>
#define NUM_THREADS 2
#define ITERATIONS 100000000
double result = 0.0;
void *leibniz(void *threadid){
  int size = ITERATIONS/NUM_THREADS;
  int start = (long)threadid * size;
  int end = ((long)threadid+1) * size;
  int i;
  for(i = start; i<end; i++){
    int denom = 2*i+1;
    result += pow(-1.0, i) * (1.0/denom);
  }
}

int main(){
  pthread_t threads[NUM_THREADS];
  long t;
  int rc;

  // CREATE
  for(t=0;t<NUM_THREADS;t++){
    rc = pthread_create(&threads[t], NULL, leibniz, (void *)t);
    if(rc){
      printf("ERROR: return code %d\n", rc);
    }
  }
  // JOIN
  for(t=0;t<NUM_THREADS;t++){
    rc = pthread_join(threads[t], NULL);
    if(rc){
      printf("ERROR: return code %d\n", rc);
      exit(-1);
    }
  }
  printf("Pi %f\n", result*4);
  exit(0);

}

Thanks to Jean-François Fabre I did these changes, now it works!

double result=0.0;

void *leibniz(void *threadid){
  double local = 0.0;
  int size = ITERATIONS/NUM_THREADS;
  int start = (long)threadid * size;
  int end = ((long)threadid+1) * size;
  int i;
  for(i = start; i<end; i++){
    local += (i%2==0 ? 1 : -1) * (1.0/(2*i+1));
  }
  result += local*4;
}

Solution

I'll attempt to answer.

Even if your application is multithreaded, it's not guaranteed that there's 1 FPU per core. I know little about that but I think that some AMD processors actually share the FPU between cores.

Since your loop is basically adding and pow, it's 99% FPU computation, so if the FPU is shared on your computer, it explains the bottleneck.

You could reduce FPU usage by not calling pow just to compute -1 or 1, which would be a scalar operation, and maybe would make a difference then. just use -1 if i is odd, 1 otherwise, or negate an external 1/-1 variable at each iteration.

Also, in order to avoid race conditions, accumulate the result in a local result, and add it in the end (protecting the addition by a mutex in the end would be even better)

double result = 0.0;
void *leibniz(void *threadid){
  double local = 0.0;
  int size = ITERATIONS/NUM_THREADS;
  int start = (long)threadid * size;
  int end = ((long)threadid+1) * size;
  int i;
  for(i = start; i<end; i++){
    int denom = 2*i+1;
    // using a ternary/scalar speeds up the "pow" computation, multithread or not
    local += (i%2 ? -1 : 1) * (1.0/denom);
  }
  // you may want to protect that addition with a pthread_mutex
  // start of critical section
  result += local;
  // end of critical section
}

http://wccftech.com/amd-one-fpu-per-core-design-zen-processors/