Parallelizing inner loop with residual calculations in OpenMP with SSE vectorization

I'm trying to parallelizing the inner loop of a program that has data dependencies (min) outside the scope of the loops. I'm having an issue where the residual calculations occuring outside the scope of the inner j loop. The code gets errors if the "#pragma omp parallel" part is included on the j loop even if the loop doesn't run at all due to a k value being too low. say (1,2,3) for example.

for (i = 0; i < 10; i++)
  {
    #pragma omp parallel for shared(min) private (j, a, b, storer, arr) //
    for (j = 0; j < k-4; j += 4)
    {
      mm_a = _mm_load_ps(&x[j]);
      mm_b = _mm_load_ps(&y[j]);
      mm_a = _mm_add_ps(mm_a, mm_b);
      _mm_store_ps(storer, mm_a);

      #pragma omp critical
      {
      if (storer[0] < min)
      {
        min = storer[0];
      }
      if (storer[1] < min)
      {
        min = storer[1];
      }
      //etc
      }
    }
    do
    {
        #pragma omp critical
        {
        if (x[j]+y[j] < min)
        {
          min = x[j]+y[j];
        }    
        } 
      }
    } while (j++ < (k - 1));
    round_min = min
  }

Solution

The j-based loop is a parallel loop so you cannot use j after the loop. This is especially true since you explicitly put j as private, so only visible locally in the thread but not outside the parallel region. You can explicitly compute the position of the remaining j value using (k-4+3)/4*4 just after the parallel loop.

Furthermore, here is few important points:

You may not really need to vectorize the code yourself: you can use omp simd reduction. OpenMP can do all the boring job of computing the residual calculations for you automatically. Moreover, the code will be portable and much simpler. The generated code may also likely be faster than yours. Note however that some compilers might not be able to vectorize the code (GCC and ICC does, while Clang and MSVC often need some help).
Critical section (omp critical) are very costly. In your case this will just annihilate any possible improvement related to the parallel section. The code will likely be slower due to cache-line bouncing.
Reading data written by _mm_store_ps is inefficient here although some compiler (like GCC) may be able to understand the logic of your code and generate a faster implementation (extracting lane data).
Horizontal SIMD reductions inefficient. Use vertical ones that are much faster and that can be easily used here.

Here is a corrected code taking into account the above points:

for (i = 0; i < 10; i++)
{
    // Assume min is already initialized correctly here

    #pragma omp parallel for simd reduction(min:min) private(j)
    for (j = 0; j < k; ++j)
    {
        const float tmp = x[j] + y[j];
        if(tmp < min)
            min = tmp;
    }

    // Use min here
}

The above code is vectorized correctly on x86 architecture on GCC/ICC (both with -O3 -fopenmp), Clang (with -O3 -fopenmp -ffastmath) and MSVC (with /O2 /fp:precise -openmp:experimental).