I am working in Python and I have a matrix stored in a text file. The text file is arranged in such a format:
row_id, col_id
row_id, col_id
...
row_id, col_id
row_id and col_id are integers and they take values from 0 to n (in order to know n for row_id and col_id I have to scan the entire file first).
there's no header and row_ids and col_ids appear multiple times in the file, but each combination row_id,col_id appears once. There's no explicit value for each combination row_id,col_id , actually each cell value is 1. The file is almost 1 gigabyte of size.
Unfortunately the file is difficult to handle in the memory, in fact, it is 2257205 row_ids and 122905 col_ids for 26622704 elements. So I was looking for better ways to handle it. Matrix market format could be a way to deal with it.
Is there a fast and memory efficient way to convert this file into a file in a market matrix format (http://math.nist.gov/MatrixMarket/formats.html#mtx) using Python?
There is a fast and memory efficient way of handling such matrices: using the sparse matrices offered by SciPy (which is the de facto standard in Python for this kind of things).
For a matrix of size N
by N
:
from scipy.sparse import lil_matrix
result = lil_matrix((N, N)) # In order to save memory, one may add: dtype=bool, or dtype=numpy.int8
with open('matrix.csv') as input_file:
for line in input_file:
x, y = map(int, line.split(',', 1)) # The "1" is only here to speed the splitting up
result[x, y] = 1
(or, in one line instead of two: result[map(int, line.split(',', 1))] = 1
).
The argument 1
given to split()
is just here to speed things up when parsing the coordinates: it instructs Python to stop parsing the line when the first (and only) comma is found. This can matter some, since you are reading a 1 GB file.
Depending on your needs, you might find one of the other six sparse matrix representations offered by SciPy to be better suited.
If you want a faster but also more memory-consuming array, you can use result = numpy.array(…)
(with NumPy) instead.