Most efficient way to calculate the exponential of each element of a matrix

I'm migrating from Matlab to C + GSL and I would like to know what's the most efficient way to calculate the matrix B for which:

B[i][j] = exp(A[i][j])

where i in [0, Ny] and j in [0, Nx].

Notice that this is different from matrix exponential:

B = exp(A)

which can be accomplished with some unstable/unsupported code in GSL (linalg.h).

I've just found the brute force solution (couple of 'for' loops), but is there any smarter way to do it?

EDIT

Results from the solution post of Drew Hall

All the results are from a 1024x1024 for(for) loop in which in each iteration two double values (a complex number) are assigned. The time is the averaged time over 100 executions.

Results when taking into account the {Row,Column}-Major mode to store the matrix:
- 226.56 ms when looping over the row in the inner loop in Row-Major mode (case 1).
- 223.22 ms when looping over the column in the inner loop in Row-Major mode (case 2).
- 224.60 ms when using the gsl_matrix_complex_set function provided by GSL (case 3).

Source code for case 1:

for(i=0; i<Nx; i++)
{
    for(j=0; j<Ny; j++)
    {
        /* Operations to obtain c_value (including exponentiation) */
        matrix[2*(i*s_tda + j)] = GSL_REAL(c_value);
        matrix[2*(i*s_tda + j)+1] = GSL_IMAG(c_value);
    }
}

Source code for case 2:

for(i=0; i<Nx; i++)
{
    for(j=0; j<Ny; j++)
    {
        /* Operations to obtain c_value (including exponentiation) */
        matrix->data[2*(j*s_tda + i)] = GSL_REAL(c_value);
        matrix->data[2*(j*s_tda + i)+1] = GSL_IMAG(c_value);
    }
}

Source code for case 3:

for(i=0; i<Nx; i++)
{
    for(j=0; j<Ny; j++)
    {
        /* Operations to obtain c_value (including exponentiation) */
        gsl_matrix_complex_set(matrix, i, j, c_value);
    }
}

Solution

There's no way to avoid iterating over all the elements and calling exp() or equivalent on each one. But there are faster and slower ways to iterate.

In particular, your goal should be to mimimize cache misses. Find out if your data is stored in row-major or column-major order, and be sure to arrange your loops such that the inner loop iterates over elements stored contiguously in memory, and the outer loop takes the big stride to the next row (if row major) or column (if column major). Although this seems trivial, it can make a HUGE difference in performance (depending on the size of your matrix).

Once you've handled the cache, your next goal is to remove loop overhead. The first step (if your matrix API supports it) is to go from nested loops (M & N bounds) to a single loop iterating over the underlying data (MN bound). You'll need to get a raw pointer to the underlying memory block (that is, a double rather than a double**) to do this.

Finally, throw in some loop unrolling (that is, do 8 or 16 elements for each iteration of the loop) to further reduce the loop overhead, and that's probably about as quick as you can make it. You'll probably need a final switch statement with fall-through to clean up the remainder elements (for when your array size % block size != 0).