Can a thread-local copy of select elements be created of a shared 2D array in a parallel region? (Shared, private, barrier: OPenMP)

I have a 2-D grid of nxn elements. In one iteration, I'm calculating the value of one element by averaging the values of its neighbors. That is:

    for(int i=0;i<n;i++)
        for(int j=0;j<n;j++)
            grid[i][j] = (grid[i-1][j] + grid[i][j-1] + grid[i+1][j] + grid[i][j+1])/4.0;

And I need to run the above nested loop for iter number of iterations. What I need is the following:

I need the threads to calculate this average, wait till all the threads have finished calculating and THEN update the grid in one go.
The loop with iter iterations will run sequentially, but during every iteration, the value of grid[i][j] for every i and j should be calculated in parallel.

In order to do that I have the following ideas and questions:

Maybe make grid shared and put a copy of the select 4 elements of the grid that is needed for calculating grid[i][j] by making only those 4 elements private to the thread. (Basically grid is shared by all threads, but there is a local copy of 4 iteration-specific elements in every thread too.) Is this possible?
Would a barrier be in fact needed for all the threads to finish and then start onto the next iteration?

I'm very new to the OpenMP way of thinking and I'm utterly lost in this simple problem. I'd be grateful if somebody could help resolve my confusion.

Solution

In practice, you'd want to have (much) fewer threads than grid points, so each thread will be calculating a whole bunch of points (for example, one row). There is a certain overhead associated with starting OpenMP (or any other kind of) threads, and you program will be memory-bound rather than CPU-bound anyway. So starting a thread per grid point will defeat the whole purpose of parallelizing the computation. Hence, your idea #1 is not recommended (I am not quite sure I understood it correctly though; maybe this is not what you were proposing).
I would recommend (also pointed out by others in OP comments) you allocate twice the memory needed to store the grid values and use two pointers that are swapped between iterations: one points to memory holding previous iteration values that are read only, the other one to new iteration values that are write-only. Note that you will only swap the pointers, not actually copy the memory. After your iteration is done, you can copy the final result into desired location.

Yes, you need to synchronize threads between iterations, however in OpenMP this is usually done implicitly simply by opening a parallel region within the iteration loop (there is an implicit barrier at the end of a parallel region):

for (int iter = 0; iter < niter; ++iter)
{
    #pragma omp parallel
    {
        // get range of points for current thread
        // loop over thread's points and apply the stencil
    }
}

or, using a parallel for construct:

const int np = n*n;
for (int iter = 0; iter < niter; ++iter)
{
    #pragma omp parallel for
    for (int ip = 0; ip < np; ++ip)
    {
        const int i = ip / n;
        const int j = ip % n;
        // apply the stencil to [i,j]
    }
}

The second version will auto-distribute the work evenly between the available threads, which is most likely what you want. In the first you have to do it manually.