matrix graph sparse-matrix matrix-multiplication adjacency-matrix

CSR x CSR Matrix Multiplication for Finding Out Cycles

I am trying to find out number of cycles in an undirected graph with specified length(k) containing vertex u for each vertex u in the graph. To do so I am trying to find out adjacency matrix's k'th power. I created CSR representation of the graph from the edge list. It is working really fast. But the CSR x CSR multiplication part is really slow, (it seems to be taking 50 min with an input size of 500k x 500k matrix). I am curious about a better solution. Is there a more efficient way to go since this is a adjacency matrix? Or Is there any better CSRxCSR matrix multiplication that I could look at? I could not find any CSR X CSR matrix multiplication example as an algorithm or c++ implementation.

    void multiply_matrix(std::vector<int> &adj, std::vector<int> &xadj, std::vector<int> &values, std::vector<int> &adj2, std::vector<int> &xadj2, std::vector<int> &values2, int size)
  {
          std::vector<int> result_adj;
          std::vector<int> result_xadj(size+1,0);
          std::vector<int> result_value(values.size(),0);
          for(int i = 0; i<size; i++)
          {
                  for(int j = 0; j<size; j++)
                  {
                          int result = 0;
                          int startIndex = xadj[i];
                          int endIndex = xadj[i+1];
                          for(int index = startIndex; index<endIndex; index++)
                          {
                                  int currentValRow = values[adj[index]];
                                  bool shouldContinue = false;
                                  for(int colIndex = xadj2[j]; colIndex<xadj2[j+1]; colIndex++)
                                  {
                                          if(adj[index] == adj2[colIndex])
                                          {
                                                  shouldContinue = true;
                                                  break;
                                          }
                                  }
                                  if(!shouldContinue)
                                          continue;
                                  int currentValCol = values2[adj2[index]];
                                  result += currentValCol*currentValRow;
                          }
                          if(result != 0)
                          {
                                  result_xadj[i+1]++;
                                  if(i+2 < result_xadj.size())
                                          result_xadj[i+2] = result_xadj[i+1];
                                  result_adj.push_back(j);
                                  result_value[j] = result;
                          }
                  }
          }
  }

Solution

I solved my problem and wanted to share with those who also lacks the required "terminology" to find out lots of resources on the topic. When you google "sparse matrix multiplication" it is hard to find sparse matrix x sparse matrix. Which is called SpGEMM. There are lots of informative papers about the process.

The pseudocode of the algorithm I used: General SpGEMM algorithm

I modified the algorithm a little bit to produce CSR output. The challange with that seems to be the allocation for result arrays to hold csr arrays (values, index_array, etc..). There are different methods used to solve that issue such as:

Allocating the arrays as big as the upper bound. Which may be a problem if your matrices are too big. If you decide to go this way you can look into: https://math.stackexchange.com/questions/1042096/bounds-of-sparse-matrix-multiplication.
Before allocating any memory for the result the multiplication operation can be done to determine the amount of non zeros in the result. Since there is no memory write operation exists in this space the result comes out really fast. So the memory required for the result arrays can be allocated after this "dummy run".
Allocating a pre-determined amount and when it is not sufficient allocating a new array and copying the content to the new, bigger array.

I implemented the function for both CPU (using OpenMP) and GPU (using CUDA). In the OpenMP approach I used a method similar to option 3 that i have listed. I used separate vectors for results of each row. Than I added the resulting vectors. The vector approach may be slower than doing the re-allocation operation manually but it was easier so I choose that way and it is fast enough (the test matrix had 500k row and 500k column the multiplication operation takes around 1.3 seconds using 60 threads on my test machine). For the GPU approach I used the option 2. At first I calculated the required amount then the actual operation happens.

Edit: Also this method finds out "walks" rather than paths. So there might be repeated vertices.