parallel-processing cuda sparse-matrix matrix-multiplication

Using both GPU device of CUDA and zero copy pinned memory

I am using the CUSP library for sparse matrix-multiplication on CUDA a machine. My current code is

#include <cusp/coo_matrix.h>
#include <cusp/multiply.h>
#include <cusp/print.h>
#include <cusp/transpose.h>
#include<stdio.h>
#define CATAGORY_PER_SCAN 1000
#define TOTAL_CATAGORY 100000
#define MAX_SIZE 1000000
#define ELEMENTS_PER_CATAGORY 10000 
#define ELEMENTS_PER_TEST_CATAGORY 1000
#define INPUT_VECTOR 1000
#define TOTAL_ELEMENTS ELEMENTS_PER_CATAGORY * CATAGORY_PER_SCAN
#define TOTAL_TEST_ELEMENTS ELEMENTS_PER_TEST_CATAGORY * INPUT_VECTOR
int main(void)
{
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);
    cusp::coo_matrix<long long int, double, cusp::host_memory> A(CATAGORY_PER_SCAN,MAX_SIZE,TOTAL_ELEMENTS);
    cusp::coo_matrix<long long int, double, cusp::host_memory> B(MAX_SIZE,INPUT_VECTOR,TOTAL_TEST_ELEMENTS);

    for(int i=0; i< ELEMENTS_PER_TEST_CATAGORY;i++){    
        for(int j = 0;j< INPUT_VECTOR ; j++){
            int index = i * INPUT_VECTOR + j ;
            B.row_indices[index] = i; B.column_indices[ index ] = j; B.values[index ] = i;
        }    
    }
    for(int i = 0;i < CATAGORY_PER_SCAN;  i++){
        for(int j=0; j< ELEMENTS_PER_CATAGORY;j++){     
            int index = i * ELEMENTS_PER_CATAGORY + j ;
            A.row_indices[index] = i; A.column_indices[ index ] = j; A.values[index ] = i;
        }    
    }
    /*cusp::print(A);
    cusp::print(B); */
    //test vector
    cusp::coo_matrix<long int, double, cusp::device_memory> A_d = A;
    cusp::coo_matrix<long int, double, cusp::device_memory> B_d = B;

        // allocate output vector
    cusp::coo_matrix<int, double, cusp::device_memory>  y_d(CATAGORY_PER_SCAN, INPUT_VECTOR ,CATAGORY_PER_SCAN * INPUT_VECTOR);
    cusp::multiply(A_d, B_d, y_d);
    cusp::coo_matrix<int, double, cusp::host_memory> y=y_d;
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime, start, stop); // that's our time!
    printf("time elaplsed %f ms\n",elapsedTime);
    return 0;
}

cusp::multiply function uses 1 GPU only (as of my understanding).

How can I use setDevice() to run same program on both the GPU(one cusp::multiply per GPU) .
Measure the total time accurately.
How can I use zero-copy pinned memory with this library as I can use malloc myself.

Solution

1 How can I use setDevice() to run same program on both the GPU

If you mean "How can I perform a single cusp::multiply operation using two GPUs", the answer is you can't.

EDIT:

For the case where you want to run two separate CUSP sparse matrix-matrix products on different GPUs, it is possible to simply wrap the operation in a loop and call cudaSetDevice before the transfers and the cusp::multiply call. You will probably not, however get any speed up by doing so. I think I am correct in saying that both the memory transfers and cusp::multiply operations are blocking calls, so the host CPU will stall until they are finished. Because of this, the calls for different GPUs cannot overlap and there will be no speed up over performing the same operation on a single GPU twice. If you were willing to use a multithreaded application and have a host CPU with multiple cores, you could probably still run them in parallel, but it won't be as straightforward host code as it seems you are hoping for.

2 Measure the total time accurately

The cuda_event approach you have now is the most accurate way of measuring the execution time of a single kernel. If you had a hypthetical multi-gpu scheme, then the sum of the events from each GPU context would be the total execution time of the kernels. If, by total time, you mean the "wallclock" time to complete the operation, then you would need to either use a host timer around the whole multigpu segment of your code. I vaguely recall that it might be possible in the latest versions of CUDA to synchronize between events in streams from different contexts in some circumstances, so a CUDA event based timer might still be usable in such a scenario.

3 How can I use zero-copy pinned memory with this library as I can use malloc myself.

To the best of my knowledge that isn't possible. The underlying thrust library CUSP uses can support containers using zero copy memory, but CUSP doesn't expose the necessary mechanisms in the standard matrix constructors to be able to use allocate a CUSP sparse matrix in zero copy memory.