Why doesn't this OpenMP paralelize?

I've got this code and method here:

#include <omp.h>

const int TS_GLOBAL = 20;

void apply_grayscale(uint8_t* input_buffer,uint8_t* output_buffer,uint16_t image_w,uint16_t image_h)
{
size_t pixel_row_size = 3 * image_w;
if(pixel_row_size % 4)
    pixel_row_size += 4 - (pixel_row_size % 4);

    
const int TS = TS_GLOBAL;
#pragma omp taskloop grainsize(TS) // <------- here
for(uint16_t i = 0; i < image_h; i++)
    {
        uint8_t* current_input_buffer = input_buffer + (i * pixel_row_size);
        uint8_t* current_output_buffer = output_buffer + (i * pixel_row_size);
    
        grayscale_row(current_input_buffer,current_output_buffer,image_w);
    }
}

It doesn't seem to scale too well. When doing it with

#pragma omp parallel for

it has a significant increase in speed. Nevertheless, I need it to work with an OpenMP's tasks implementation too. Here is a code snippet of a working #prama omp parallel that works just fine.

void apply_grayscale(uint8_t* input_buffer,uint8_t* output_buffer,uint16_t image_w,uint16_t image_h)
{
    size_t pixel_row_size = 3 * image_w;
    if(pixel_row_size % 4)
        pixel_row_size += 4 - (pixel_row_size % 4);
    
        
    #pragma omp parallel for
    for(uint16_t i = 0; i < image_h; i++)
    {
        uint8_t* current_input_buffer = input_buffer + (i * pixel_row_size);
        uint8_t* current_output_buffer = output_buffer + (i * pixel_row_size);
        
        grayscale_row(current_input_buffer,current_output_buffer,image_w);
    }
}

However, I want to use the task paradigm, to gain a better understanding of the technology.

I found this task parallelization from openmp's official tutorial site. It should do virtually the same task-oriented for.

Any idea why I don't get any gains?

Solution

Courtesy to @Homer512 it's now working. If anyone ever encounters this issue again, here is how I solved it!

const int per_task = 2;

void apply_grayscale(uint8_t* input_buffer,uint8_t* output_buffer,uint16_t image_w,uint16_t image_h){
  size_t pixel_row_size = 3 * image_w;
  if(pixel_row_size % 4)
      pixel_row_size += 4 - (pixel_row_size % 4);

  
  #pragma omp parallel
  #pragma omp single
  {
      for(uint16_t i = 0; i < image_h; i+=per_task) {
          #pragma omp task
          {
              for (int j = 0; j < per_task && i + j < image_h; j++) {
                  uint8_t* current_input_buffer = input_buffer + ((i + j) * pixel_row_size);
                  uint8_t* current_output_buffer = output_buffer + ((i + j) * pixel_row_size);            
                  grayscale_row(current_input_buffer,current_output_buffer,image_w);
              }
          }
      }   
  }
}
}