Search code examples
csparse-matrix

Converting edgelist to compressed sparse row in C


I have an undirected edge list containing millions of edges. Simplified example of an undirected edge list for a 10x10 sparse adjacency matrix:

0 2
0 9
2 8
6 9

I want to convert the edge list into compressed sparse row (definition) format. This means read the edge list and writing to three arrays: Value (always "1" in my case), Column_Index and Row_Pointer.

Reading off the example edge list, I can easily reconstruct the 0th row: it has a "1" in the 2nd and 9th column. Then, the 1st row has no non-zero entries.

Problem

For the 2nd row, because the edges are undirected, I am suppose to have a "1" in column 0 and 8. But the "2 0" entry isn't present in the list. I suppose this information is already encoded in the "0 2" entry.

I could read off the partially constructed compressed sparse row arrays to see if the "2 0" entry exists but for a large edgelist containing millions of entries, this doesn't work.

Question

How do I resolve this? Or is my approach wrong?


Solution

  • You could scan the edge list, swapping indexes so that for each (i, j) it is always true that i < j. This you do in O(N).

    You also need a sorted edge list, and this is O(N log N). Once you have the sorted edge list, you can store it in Symmetric-CSR format. When reading cell (y,x), if y > x then you swap y and x. Then you read row_pointer[y] and row_pointer[y+1], let them be Pa and Pb, and start scanning CSR[i] for i between Pa and Pb; you exit if x >= CSR[i] (found or not found depending if = or >), or if i == Pb (not found).

    You could also generate a second edge list where j > i, and sort it too. At this point, you can scan both edges at the same time, and generate a CSR list without need of symmetry.

    j0 = j1 = N+1
    # i-th row:
        # we are scanning NodesIJ[ij] and NodesJI[ji].
        If NodesIJ[ij][0] == i
            j0 = NodesIJ[ij][1]
        If NodesJI[ji][0] == i
            j1 = NodesIJ[ji][1]
        if (j0 < j1)
            j = j0
            j0 = N+1
            ij++
        else
            if (j1 == N+1)
               # Nothing more on this row, and j0 and j1 are both N+1.
               i++;
               continue
            j = j1
            j1 = N+1
            ji++
        # We may now store (i,j) in CSR
        if (-1 == row_ind[i])
            row_ind[i]     = csr;
        col_ind[csr++] = j
    

    The algorithm above can be improved observing that for any given i, if there exist p and q such that NodesIJ[p] = i and NodesJI[q] = i, it will always be NodesIJ[p][1] > NodesJI[q][1] since the former list describes the upper right triangular and the latter describes the lower left. So we can scan NodesJI until NodesJI[p][0] is i, and then go on to NodesJI[q].

    We can also avoid always checking for row_ind initialization noting that if csr index does not change, then the row is empty and the corresponding value can be -1 (or N+1, or whatever "invalid" value we want), otherwise it has to be the previous value of csr.

        scsr = csr;
        while NodesIJ[ij][0] == i
            col_ind[csr++] = NodesIJ[ij++][1]
        while NodesJI[ji][0] == i
            col_ind[csr++] = NodesJI[ji++][1]
        row_ind[i++] = (csr == scsr) ? -1 : scsr;
    

    The above runs in O(N log N).

    An alternative is to allocate the matrix, decode the edge list into the matrix and parse it into a CSR. This is O(N), but may require too much memory; for a list size of N, you may have up to N^2 (or (N/a)^2, a being the average number of connections) cells. A list of millions of edges might easily require tens of gigabytes of storage.