I have an undirected edge list containing millions of edges. Simplified example of an undirected edge list for a 10x10 sparse adjacency matrix:
0 2
0 9
2 8
6 9
I want to convert the edge list into compressed sparse row (definition) format. This means read the edge list and writing to three arrays: Value (always "1" in my case), Column_Index and Row_Pointer.
Reading off the example edge list, I can easily reconstruct the 0th row: it has a "1" in the 2nd and 9th column. Then, the 1st row has no non-zero entries.
Problem
For the 2nd row, because the edges are undirected, I am suppose to have a "1" in column 0 and 8. But the "2 0" entry isn't present in the list. I suppose this information is already encoded in the "0 2" entry.
I could read off the partially constructed compressed sparse row arrays to see if the "2 0" entry exists but for a large edgelist containing millions of entries, this doesn't work.
Question
How do I resolve this? Or is my approach wrong?
You could scan the edge list, swapping indexes so that for each (i, j)
it is always true that i < j
. This you do in O(N).
You also need a sorted edge list, and this is O(N log N). Once you have the sorted edge list, you can store it in Symmetric-CSR format. When reading cell (y,x)
, if y > x
then you swap y
and x
. Then you read row_pointer[y]
and row_pointer[y+1]
, let them be Pa
and Pb
, and start scanning CSR[i]
for i between Pa
and Pb
; you exit if x
>= CSR[i]
(found or not found depending if = or >), or if i == Pb
(not found).
You could also generate a second edge list where j > i
, and sort it too. At this point, you can scan both edges at the same time, and generate a CSR list without need of symmetry.
j0 = j1 = N+1
# i-th row:
# we are scanning NodesIJ[ij] and NodesJI[ji].
If NodesIJ[ij][0] == i
j0 = NodesIJ[ij][1]
If NodesJI[ji][0] == i
j1 = NodesIJ[ji][1]
if (j0 < j1)
j = j0
j0 = N+1
ij++
else
if (j1 == N+1)
# Nothing more on this row, and j0 and j1 are both N+1.
i++;
continue
j = j1
j1 = N+1
ji++
# We may now store (i,j) in CSR
if (-1 == row_ind[i])
row_ind[i] = csr;
col_ind[csr++] = j
The algorithm above can be improved observing that for any given i
, if there exist p
and q
such that NodesIJ[p] = i
and NodesJI[q] = i
, it will always be NodesIJ[p][1] > NodesJI[q][1]
since the former list describes the upper right triangular and the latter describes the lower left. So we can scan NodesJI until NodesJI[p][0]
is i, and then go on to NodesJI[q]
.
We can also avoid always checking for row_ind
initialization noting that if csr
index does not change, then the row is empty and the corresponding value can be -1 (or N+1, or whatever "invalid" value we want), otherwise it has to be the previous value of csr
.
scsr = csr;
while NodesIJ[ij][0] == i
col_ind[csr++] = NodesIJ[ij++][1]
while NodesJI[ji][0] == i
col_ind[csr++] = NodesJI[ji++][1]
row_ind[i++] = (csr == scsr) ? -1 : scsr;
The above runs in O(N log N).
An alternative is to allocate the matrix, decode the edge list into the matrix and parse it into a CSR. This is O(N), but may require too much memory; for a list size of N, you may have up to N^2 (or (N/a)^2, a being the average number of connections) cells. A list of millions of edges might easily require tens of gigabytes of storage.