Search code examples
pythonmatrixsparse-matrix

From text file to a market matrix format


I am working in Python and I have a matrix stored in a text file. The text file is arranged in such a format:

row_id, col_id
row_id, col_id
...
row_id, col_id

row_id and col_id are integers and they take values from 0 to n (in order to know n for row_id and col_id I have to scan the entire file first).

there's no header and row_ids and col_ids appear multiple times in the file, but each combination row_id,col_id appears once. There's no explicit value for each combination row_id,col_id , actually each cell value is 1. The file is almost 1 gigabyte of size.

Unfortunately the file is difficult to handle in the memory, in fact, it is 2257205 row_ids and 122905 col_ids for 26622704 elements. So I was looking for better ways to handle it. Matrix market format could be a way to deal with it.

Is there a fast and memory efficient way to convert this file into a file in a market matrix format (http://math.nist.gov/MatrixMarket/formats.html#mtx) using Python?


Solution

  • There is a fast and memory efficient way of handling such matrices: using the sparse matrices offered by SciPy (which is the de facto standard in Python for this kind of things).

    For a matrix of size N by N:

    from scipy.sparse import lil_matrix
    
    result = lil_matrix((N, N))  # In order to save memory, one may add: dtype=bool, or dtype=numpy.int8
    
    with open('matrix.csv') as input_file:
        for line in input_file:
            x, y = map(int, line.split(',', 1))  # The "1" is only here to speed the splitting up
            result[x, y] = 1
    

    (or, in one line instead of two: result[map(int, line.split(',', 1))] = 1).

    The argument 1 given to split() is just here to speed things up when parsing the coordinates: it instructs Python to stop parsing the line when the first (and only) comma is found. This can matter some, since you are reading a 1 GB file.

    Depending on your needs, you might find one of the other six sparse matrix representations offered by SciPy to be better suited.

    If you want a faster but also more memory-consuming array, you can use result = numpy.array(…) (with NumPy) instead.