From text file to a market matrix format

I am working in Python and I have a matrix stored in a text file. The text file is arranged in such a format:

row_id, col_id
row_id, col_id
...
row_id, col_id

row_id and col_id are integers and they take values from 0 to n (in order to know n for row_id and col_id I have to scan the entire file first).

there's no header and row_ids and col_ids appear multiple times in the file, but each combination row_id,col_id appears once. There's no explicit value for each combination row_id,col_id , actually each cell value is 1. The file is almost 1 gigabyte of size.

Unfortunately the file is difficult to handle in the memory, in fact, it is 2257205 row_ids and 122905 col_ids for 26622704 elements. So I was looking for better ways to handle it. Matrix market format could be a way to deal with it.

Is there a fast and memory efficient way to convert this file into a file in a market matrix format (http://math.nist.gov/MatrixMarket/formats.html#mtx) using Python?

Solution

There is a fast and memory efficient way of handling such matrices: using the sparse matrices offered by SciPy (which is the de facto standard in Python for this kind of things).

For a matrix of size N by N:

from scipy.sparse import lil_matrix

result = lil_matrix((N, N))  # In order to save memory, one may add: dtype=bool, or dtype=numpy.int8

with open('matrix.csv') as input_file:
    for line in input_file:
        x, y = map(int, line.split(',', 1))  # The "1" is only here to speed the splitting up
        result[x, y] = 1

(or, in one line instead of two: result[map(int, line.split(',', 1))] = 1).

The argument 1 given to split() is just here to speed things up when parsing the coordinates: it instructs Python to stop parsing the line when the first (and only) comma is found. This can matter some, since you are reading a 1 GB file.

Depending on your needs, you might find one of the other six sparse matrix representations offered by SciPy to be better suited.

If you want a faster but also more memory-consuming array, you can use result = numpy.array(…) (with NumPy) instead.