Search code examples
pythondatabasefilesparse-matrix

Performance of symmetric sparse matrix of dimension 5 000 000: Save to Database or File?


I have a huge dataset (around 5 000 000 rows in a database) which I want to represent as a graph. For algorithmic reasons it is required to store the dataset in a adjacency matrix. The Matrix will be very sparse and symmmetric.

First I thought of storing the graph in a database table. This would require 5 000 000 rows, which should be no problem. But 5 000 000 columns? I don't know much of databases but I have the feeling, that this would be no recommended way of doing this.

After some searching within google, I found SciPy which has several Sparse Matrix Objects. lil_matrix and coo_matrix seem to be what I need.

Since I will operate on this matrix using python, SciPy seems a good why to go. The question for me now is how to store the graph aka sparse matrix?

Should I use a csv file? Should I use coo_matrix to save the matrix into a daatabase_table? Both would result into around 2 500 000 000 000 rows/lines

Or is there a far better way for creating and storing such a symmetric and sparse "Matrix" of dimension around 5 000 000 in python?

I am using numpy and some self written algorithms in python, which I want to run on the matrix. So it would be cool, if the suggestions make it easy to use python on the graph.

I don't know if I provided enough information for an answer. If you need more information: Feel free to ask me in a comment or so. I will gladly edit my answer.

Thanks in advance for any suggestion!


Solution

  • You can use the numpy sparse matrix format. But all of your questions depend on the number of non-zero entries (NNZ) in the matrix. Storage and lots of computations are dependent (approximately) only on the NNZ. Start here.