How to accumulate vectors in OpenCL?

I have a set of operations running in a loop.

for(int i = 0; i < row; i++)
{
    sum += arr1[0] - arr2[0]
    sum += arr1[0] - arr2[0]
    sum += arr1[0] - arr2[0]
    sum += arr1[0] - arr2[0]

    arr1 += offset1;
    arr2 += offset2;
}

Now I'm trying to vectorize the operations like this

for(int i = 0; i < row; i++)
{
    convert_int4(vload4(0, arr1) - vload4(0, arr2));

    arr1 += offset1;
    arr2 += offset2;
}

But how do I accumulate the resulting vector in the scalar sum without using a loop?

I'm using OpenCL 2.0.

Solution

I have found a solution which seems to be the closest way I could have expected to solve my problem.

uint sum = 0;
uint4 S;

for(int i = 0; i < row; i++)
{
    S += convert_uint4(vload4(0, arr1) - vload4(0, arr2));

    arr1 += offset1;
    arr2 += offset2;
}

S.s01 = S.s01 + S.s23;
sum = S.s0 + S.s1;

OpenCL 2.0 provides this functionality with vectors where the elements of the vectors can successively be replaced with the addition operation as shown above. This can support up to a vector of size 16. Larger operations can be split into factors of smaller operations. For example, for adding the absolute values of differences between two vectors of size 32, we can do the following:

uint sum = 0;
uint16 S0, S1;

for(int i = 0; i < row; i++)
{
    S0 += convert_uint16(abs(vload16(0, arr1) - vload16(0, arr2)));
    S1 += convert_uint16(abs(vload16(1, arr1) - vload16(1, arr2)));

    arr1 += offset1;
    arr2 += offset2;
}

S0 = S0 + S1;
S0.s01234567 = S0.s01234567 + S0.s89abcdef;
S0.s0123 = S0.s0123 + S0.s4567;
S0.s01 = S0.s01 + S0.s23;
sum = S0.s0 + S0.s1;