java algorithm optimization parallel-processing scientific-computing

How to optimize the computing speed of this for loop?

I am kind of stuck with a difficult problem (for me, at least). I noticed when profiling my code that nearly all my (single core) computing time was eaten up by the single nested loop below (double integral on an image). What do you think would be the best way to accelerate its computing?

I tried to map it to nested streams, but I do not understand how to map the multiple if blocks... Would trying to do it on the GPU using OpenCL be better suited to the problem?

ip is an ImageJ ImageProcessor, and its method .getPixelValue(x,y) is also quite ressource consuming. But since it belongs to an established lib, I would like to avoid modifying it if I can.

Variables declarations:

private ImageProcessor ip = null; //This type comes from ImageJ
private double area;
private double a11, a22;
private double u1, u2;
private double v1, v2;
private double y1, y2;
private static final double HALF_SQRT2 = sqrt(2.0) / 2.0;
private static final double SQRT_TINY =
    sqrt((double)Float.intBitsToFloat((int)0x33FFFFFF));

Function:

private double contrast (
) {
    if (area < 1.0) {
        return(1.0 / SQRT_TINY);
    }
    double c = 0.0;
    final int xmin = max((int)floor(u1), 0);
    final int xmax = min((int)ceil(v1), width - 1);
    final int ymin = max((int)floor(u2), 0);
    final int ymax = min((int)ceil(v2), height - 1);
    if ((u1 < xmin) || (xmax < v1) || (u2 < ymin) || (ymax < v2)){
        return(1.0 / SQRT_TINY);
    }
    if ((xmax <= xmin) || (ymax <= ymin)) {
        return(1.0 / SQRT_TINY);
    }
    for (int y = ymin; (y <= ymax); y++) {
        final double dy = y2 - (double)y;
        final double dy2 = dy * dy;
        for (int x = xmin; (x <= xmax); x++) {
            final double dx = y1 - (double)x;
            final double dx2 = dx * dx;
            final double d = sqrt(dx2 + dy2);
            double z = a11 * dx2 + a12 * dx * dy + a22 * dy2;
            if (z < SQRT_TINY) {
                c -= ip.getPixelValue(x, y);
                continue;
            }
            z = a3 / sqrt(z);
            double d0 = (1.0 - z / SQRT2) * d;
            if (d0 < -HALF_SQRT2) {
                c -= ip.getPixelValue(x, y);
                continue;
            }
            if (d0 < HALF_SQRT2) {
                c += SQRT2 * d0 * ip.getPixelValue(x, y);
                continue;
            }
            d0 = (1.0 - z) * d;
            if (d0 < -1.0) {
                c += ip.getPixelValue(x, y);
                continue;
            }
            if (d0 < 1.0) {
                c += (1.0 - d0) * ip.getPixelValue(x, y) / 2.0;
                continue;
            }
        }
    }
    return(c / area);

Solution

You could try a divide and conquer approach.

Divide the image into n parts, that are processed in parallel. But you have to handle the edge cases that occur on the borders (where two parts meet).

Or you could start looking for numeric algorithms, which compute (discrete) integrals and were designed for parallelism.

Update

Since your method is called contrast, I assume you're changing the contrast of the image.

Operations on images can be performed via convolution (which is essentially a discrete double integral, if performed on a 2D image) with a specific kernel (image filter). These operations can be computed on the GPU and yield speed-ups by orders of magnitude. You can use OpenCl to write programs that execute on several GPUs.