Converting edgelist to compressed sparse row in C

I have an undirected edge list containing millions of edges. Simplified example of an undirected edge list for a 10x10 sparse adjacency matrix:

I want to convert the edge list into compressed sparse row (definition) format. This means read the edge list and writing to three arrays: Value (always "1" in my case), Column_Index and Row_Pointer.

Reading off the example edge list, I can easily reconstruct the 0th row: it has a "1" in the 2nd and 9th column. Then, the 1st row has no non-zero entries.

Problem

For the 2nd row, because the edges are undirected, I am suppose to have a "1" in column 0 and 8. But the "2 0" entry isn't present in the list. I suppose this information is already encoded in the "0 2" entry.

I could read off the partially constructed compressed sparse row arrays to see if the "2 0" entry exists but for a large edgelist containing millions of entries, this doesn't work.

Question

How do I resolve this? Or is my approach wrong?

Solution

You could scan the edge list, swapping indexes so that for each (i, j) it is always true that i < j. This you do in O(N).

You also need a sorted edge list, and this is O(N log N). Once you have the sorted edge list, you can store it in Symmetric-CSR format. When reading cell (y,x), if y > x then you swap y and x. Then you read row_pointer[y] and row_pointer[y+1], let them be Pa and Pb, and start scanning CSR[i] for i between Pa and Pb; you exit if x >= CSR[i] (found or not found depending if = or >), or if i == Pb (not found).

You could also generate a second edge list where j > i, and sort it too. At this point, you can scan both edges at the same time, and generate a CSR list without need of symmetry.

j0 = j1 = N+1
# i-th row:
    # we are scanning NodesIJ[ij] and NodesJI[ji].
    If NodesIJ[ij][0] == i
        j0 = NodesIJ[ij][1]
    If NodesJI[ji][0] == i
        j1 = NodesIJ[ji][1]
    if (j0 < j1)
        j = j0
        j0 = N+1
        ij++
    else
        if (j1 == N+1)
           # Nothing more on this row, and j0 and j1 are both N+1.
           i++;
           continue
        j = j1
        j1 = N+1
        ji++
    # We may now store (i,j) in CSR
    if (-1 == row_ind[i])
        row_ind[i]     = csr;
    col_ind[csr++] = j

The algorithm above can be improved observing that for any given i, if there exist p and q such that NodesIJ[p] = i and NodesJI[q] = i, it will always be NodesIJ[p][1] > NodesJI[q][1] since the former list describes the upper right triangular and the latter describes the lower left. So we can scan NodesJI until NodesJI[p][0] is i, and then go on to NodesJI[q].

We can also avoid always checking for row_ind initialization noting that if csr index does not change, then the row is empty and the corresponding value can be -1 (or N+1, or whatever "invalid" value we want), otherwise it has to be the previous value of csr.

    scsr = csr;
    while NodesIJ[ij][0] == i
        col_ind[csr++] = NodesIJ[ij++][1]
    while NodesJI[ji][0] == i
        col_ind[csr++] = NodesJI[ji++][1]
    row_ind[i++] = (csr == scsr) ? -1 : scsr;

The above runs in O(N log N).

An alternative is to allocate the matrix, decode the edge list into the matrix and parse it into a CSR. This is O(N), but may require too much memory; for a list size of N, you may have up to N^2 (or (N/a)^2, a being the average number of connections) cells. A list of millions of edges might easily require tens of gigabytes of storage.