How can I parallelize the process of copying the rows of a matrix randomly to another matrix in memory?

I have a matrix, call it small_matrix made up of about 100000 rows and 128 columns stored as a single array (I'm using it for CUDA calculations so the saved space is needed). I have a larger matrix, call it large_matrix, with 10x the number of rows and the same row length as small_matrix and I want to populate its rows with the rows from as small_matrix. However, the population process isn't 1:1. There is a map array that maps every row in large_matrix to a row in small_matrix. A single row in small_matrix can be mapped to by multiple rows in large_matrix. We can assume that the map array is randomly generated. Also, there is a small chance (assume it to be 1%) that a row in large_matrix will have random values instead of actual values.

I'm trying to optimize this process by means of parallelism using OMP on C++ but I just cannot seem to be able to. Anything I've tried so far has only lead to increasing the runtime with more threads instead of decreasing it. Here is the problem's code, I'm trying to optimize expand_matrix:

#include <stdio.h>
#include <omp.h>
#include <random>
#include <stdlib.h>
#include <cstddef>
#include <ctime>
#include <cstring>
using namespace std;

inline void* aligned_malloc(size_t size, size_t align){
    void *result;
    #ifdef _MSC_VER 
    result = _aligned_malloc(size, align);
    #else 
     if(posix_memalign(&result, align, size)) result = 0;
    #endif
    return result;
}
inline void aligned_free(void *ptr) {
    #ifdef _MSC_VER 
        _aligned_free(ptr);
    #else 
      free(ptr);
    #endif

}

void expand_matrix(int num_rows_in_large_matrix, int row_length, long long* map,  float*small_matrix, float* large_matrix, const int num_threads);


int main(){
    int row_length = 128;
    long long small_matrix_rows = 100000;
    long long large_matrix_rows = 1000000;
    long long *map = new long long [large_matrix_rows]; 
    float *small_matrix = (float*)aligned_malloc(small_matrix_rows*128*sizeof(float), 128);
    float *large_matrix = (float*)aligned_malloc(large_matrix_rows*128*sizeof(float), 128);

    minstd_rand gen(std::random_device{}()); //NOTE: Valgrind will give an error saying: vex amd64->IR: unhandled instruction bytes: 0xF 0xC7 0xF0 0x89 0x6 0xF 0x42 0xC1 :: look: https://bugs.launchpad.net/ubuntu/+source/valgrind/+bug/
    uniform_real_distribution<double> values_dist(0, 1);
    uniform_int_distribution<long long> map_dist(0,small_matrix_rows);
    for (long long i = 0; i<small_matrix_rows*row_length;i++){
        small_matrix[i] = values_dist(gen)-0.5;
    }
    for (long long i=0; i<large_matrix_rows;i++){
        if (values_dist(gen)<0.99)
            map[i] = map_dist(gen);
    }
    clock_t start, end;
    int num_threads =4;
    printf("Populated matrix and generated map\n");
    start = clock();
    expand_matrix(large_matrix_rows, row_length, map, small_matrix, large_matrix, num_threads);
    end = clock();
    printf("Time to expand using %d threads = %f\n", num_threads, double(end-start)/CLOCKS_PER_SEC);
    return 0;
}



void expand_matrix(int num_rows_in_large_matrix, int row_length, long long* map,  float*small_matrix, float* large_matrix, const int num_threads){

  #pragma omp parallel num_threads(num_threads)
  {
    #pragma omp for schedule(guided, 4) 
    for(unsigned int i = 0; i < num_rows_in_large_matrix; i++ ){
      long long sml = map[i];
      if(sml == -1){
        for (int j = 0; j < row_length; j++)
          large_matrix[i * row_length + j] = 0.5;
      }
      else{
        memcpy(large_matrix+i*row_length, small_matrix+sml*row_length, row_length*sizeof(float));
      }
    }
  }
}

Here are some runtimes:

Time to expand using 1 threads = 0.402949
Time to expand using 2 threads = 0.530361
Time to expand using 4 threads = 0.608085
Time to expand using 8 threads = 0.667806
Time to expand using 16 threads = 0.999886

I've made sure the matrices were aligned with memory, I've tried using non-temporal instructions for copying, I am stumped. I don't know where to look anymore. Any help is much appreciated.

Some hardware information:

CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K

Using Ubuntu 16.04 and gcc version 5.5.0 20171010 (Ubuntu 5.5.0-12ubuntu1~16.04).

Solution

Thanks to @Gilles and @Zulan for pointing out the error. I will post it as an answer so others can see the issue. I was using the wrong time measurement method; my method does not work with multithreaded applications. In other words, I misused the clock() function. Here is @Giller's answer:

clock() measures CPU time which increases with the number of CPU you add. omp_get_wtime() measures the wallclock time which is the one you want to see decreasing

the function I'm using to measure the time of the function execution is clock(). This function counts the number of processor ticks taken by all the processors involved in running the code. When I run my code in parallel using multiple processors, the clock ticks returned by clock() are the total of all the processors and so the number is only increasing as I increase the number of processors. When I switched the time measurement to omp_get_wtime() the timing returned was correct and I got the following results:

1 thread = 0.423516
4 threads = 0.152680 
8 threads = 0.090841
16 threads = 0.064748

So, instead of measuring the runtime like this:

    clock_t start, end;
    start = clock();
    expand_matrix(large_matrix_rows, row_length, map, small_matrix, large_matrix, num_threads);
    end = clock();
    printf("Total time %f\n", double(end-start)/CLOCKS_PER_SEC);

I do it like this:

    double start, end;
    start = omp_get_wtime();
    expand_matrix(large_matrix_rows, row_length, map, small_matrix, large_matrix, num_threads);
    end = omp_get_wtime();
    printf("Total time %f\n", end-start);