Speed up nested loops and bitwise operations with Alea GPU

I'm trying to use Alea to speed up a program I'm working on but I need some help.

What I need to do is a lot of bitcount and bitwise operations with values stored in two arrays.

For each element of my first array I have to do a bitwise & operation with each element of my second array, then count the bits set to 1 of the & result.

If the result is greater than/equal to a certain value I need to exit the inner for and go to the next element of my first array.

The first array is usually a big one, with millions of elements, the second one is usually less than 200.000 elements.

Trying to do all these operations in parallel, here is my code:

[GpuManaged]
private long[] Check(long[] arr1, long[] arr2, int limit)
{
    Gpu.FreeAllImplicitMemory(true);
    var gpu = Gpu.Default;
    long[] result = new long[arr1.Length];
    gpu.For(0, arr1.Length, i =>
    {
        bool found = false;
        long b = arr1[i];
        for (int i2 = 0; i2 < arr2.Length; i2++)
        {
            if (LibDevice.__nv_popcll(b & arr2[i2]) >= limit)
            {
                found = true;
                break;
            }
        }
        if (!found)
        {
            result[i] = b;
        }
    });
    return result;
}

This works as expected but is just a little faster than my version running in parallel on a quad core CPU.

I'm certainly missing something here, it's my very first attempt to write GPU code.

By the way, my NVIDIA is a GeForce GT 740M.

EDIT

The following code is 2x faster than the previous one, at least on my PC. Many thanks to Michael Randall for pointing me in the right direction.

private static int[] CheckWithKernel(Gpu gpu, int[] arr1, int[] arr2, int limit)
{
    var lp = new LaunchParam(16, 256);
    var result = new int[arr1.Length];
    try
    {
        using (var dArr1 = gpu.AllocateDevice(arr1))
        using (var dArr2 = gpu.AllocateDevice(arr2))
        using (var dResult = gpu.AllocateDevice<int>(arr1.Length))
        {
            gpu.Launch(Kernel, lp, arr1.Length, arr2.Length, dArr1.Ptr, dArr2.Ptr, dResult.Ptr, limit);
            Gpu.Copy(dResult, result);
            return result;
        }
    }
    finally
    {
        Gpu.Free(arr1);
        Gpu.Free(arr2);
        Gpu.Free(result);
    }
}

private static void Kernel(int a1, int a2, deviceptr<int> arr1, deviceptr<int> arr2, deviceptr<int> arr3, int limit)
{
    var iinit = blockIdx.x * blockDim.x + threadIdx.x;
    var istep = gridDim.x * blockDim.x;
    for (var i = iinit; i < a1; i += istep)
    {
        bool found = false;
        int b = arr1[i];
        for (var j = 0; j < a2; j++)
        {
            if (LibDevice.__nv_popcll(b & arr2[j]) >= limit)
            {
                found = true;
                break;
            }
        }
        if (!found)
        {
            arr3[i] = b;
        }
    }
}

Solution

Update

It seems pinning won't work with GCHandle.Alloc()

However the point of this answer is you will get a much greater performance gain out of direct memory access.

http://www.aleagpu.com/release/3_0_3/doc/advanced_features_csharp.html

Directly Working with Device Memory

Device memory provides even more flexibility as it also allows all kinds of pointer arithmetics. Device memory is allocated with

Memory<T> Gpu.AllocateDevice<T>(int length)
Memory<T> Gpu.AllocateDevice<T>(T[] array)

The first overload creates a device memory object for the specified type T and length on the selected GPU. The second one allocates storage on the GPU and copies the .NET array into it. Both return a Memory<T> object, which implements IDisposable and can therefore support the using syntax which ensures proper disposal once the Memory<T> object goes out of scope. A Memory<T> object has properties to determine the length, the GPU or the device on which it lives. The Memory<T>.Ptr property returns a deviceptr<T>, which can be used in GPU code to access the actual data or to perform pointer arithmetics. The following example illustrates a simple use case of device pointers. The kernel only operates on part of the data, defined by an offset.

using (var dArg1 = gpu.AllocateDevice(arg1))
using (var dArg2 = gpu.AllocateDevice(arg2))
using (var dOutput = gpu.AllocateDevice<int>(Length/2))
{           
    // pointer arithmetics to access subset of data
    gpu.Launch(Kernel, lp, dOutput.Length, dOutput.Ptr, dArg1.Ptr + Length/2, dArg2.Ptr + Length / 2);

    var result = dOutput.ToArray();

    var expected = arg1.Skip(Length/2).Zip(arg2.Skip(Length/2), (x, y) => x + y);

    Assert.That(result, Is.EqualTo(expected));
}

Original Answer

Disregarding the logic going on, or how relevant this is to GPU code. However you could compliment your Parallel routine and possibly speed things up by by Pinning your Arrays in memory with GCHandle.Alloc() and the GCHandleType.Pinned flag and using Direct Pointer access (if you can run unsafe code)

Notes

You will cop a hit from pinning the memory, however for large arrays you can realize a lot of performance from direct access*
You will have to mark your assembly unsafe in Build Properties*
This is obviously untested and just an example*
You could used fixed, however the Parallel Lambda makes it fiddlier

Example

private unsafe long[] Check(long[] arr1, long[] arr2, int limit)
{   
   Gpu.FreeAllImplicitMemory(true);
   var gpu = Gpu.Default;    
   var result = new long[arr1.Length];

   // Create some pinned memory
   var resultHandle = GCHandle.Alloc(result, GCHandleType.Pinned);
   var arr2Handle = GCHandle.Alloc(result, GCHandleType.Pinned);
   var arr1Handle = GCHandle.Alloc(result, GCHandleType.Pinned);

   // Get the addresses
   var resultPtr = (int*)resultHandle.AddrOfPinnedObject().ToPointer();
   var arr2Ptr = (int*)arr2Handle.AddrOfPinnedObject().ToPointer();
   var arr1Ptr = (int*)arr2Handle.AddrOfPinnedObject().ToPointer();

   // I hate nasty lambda statements. I always find local methods easier to read.    
   void Workload(int i)
   {
      var found = false;    
      var b = *(arr1Ptr + i);

      for (var j = 0; j < arr2.Length; j++)
      {
         if (LibDevice.__nv_popcll(b & *(arr2Ptr + j)) >= limit)
         {
            found = true;
            break;
         }
      }

      if (!found)
      {
         *(resultPtr + i) = b;
      }
   }

   try
   {
      gpu.For(0, arr1.Length, i => Workload(i));
   }
   finally 
   {
      // Make sure we free resources
      arr1Handle.Free();
      arr2Handle.Free();
      resultHandle.Free();
   } 
   return result;    
}

Additional Resources

GCHandle.Alloc Method (Object)

A new GCHandle that protects the object from garbage collection. This GCHandle must be released with Free when it is no longer needed.

GCHandleType Enumeration

Pinned : This handle type is similar to Normal, but allows the address of the pinned object to be taken. This prevents the garbage collector from moving the object and hence undermines the efficiency of the garbage collector. Use the Free method to free the allocated handle as soon as possible.

Unsafe Code and Pointers (C# Programming Guide)

In the common language runtime (CLR), unsafe code is referred to as unverifiable code. Unsafe code in C# is not necessarily dangerous; it is just code whose safety cannot be verified by the CLR. The CLR will therefore only execute unsafe code if it is in a fully trusted assembly. If you use unsafe code, it is your responsibility to ensure that your code does not introduce security risks or pointer errors.

A note, there has since been an update, this:

http://www.aleagpu.com/release/3_0_3/doc/advanced_features_csharp.html

is now this:

http://www.aleagpu.com/release/3_0_4/doc/advanced_features_csharp.html

some of the samples and info have changed or moved in release 3.0.4.