Search code examples
sqlnosqlsparse-matrixbigdatadatabase

How to store sparse adjacency matrix


I've read several topics, but I'm lost. I'm quite new to this. I want to store huge sparse matrix and have several idea's but can choose between them. Here's my needs:

  1. Adjacency matrix of approx. 50 million vertices.
  2. Maximum amount of neighbors per one vertex - approx. 10 000.
  3. Average amount of neighbors per one vertex - approx. 200-300.
  4. Fast row query - vector will be multiplied by this matrix.
  5. O(1) complexity to add edge.
  6. Most probably, edges will not be deleted.
  7. Enumeration of the vertices adjacent to v - as fast as possible.
  8. Portability - there must be a way to transfer base from one computer to another.

So, here's my ideas:

  1. Huge table with pairs (row, col). Very simple, but enumeration of vertices will be at least O(log N), where N - size of table. It's quite slow as I think. Also, it must be indexed. Every RDBMS will be good for what.
  2. Enormous amount of lists: one list per vertex. Very fast enumeration, but wouldn't it take much resources to storage this? Also, I'm not sure about which DBMS to use in this case: maybe some NoSql?
  3. Huge table (row | set of cols). Combination of two above. I'm not sure is there any RDBMS to support arbitrary sets. Do you know any? Maybe NoSql will be useful here?
  4. Collection of adjacency lists. Any RDBMS will be suitable for that, and costs in terms of complexity are good, but they can be killed by multiple request to DB for one vertex.
  5. HDF5 - I think it will be slow due to I/O.
  6. Neo4j - As far as I understand, it storages data in double-linked lists, so it will be practically the same as №4, am i right?

Please, help me to choose or offer a better decision.

If I'm wrong with estimates somewhere, please correct me.


Solution

  • In the end, I've implemented solution number one.

    I used PostgreSQL with two tables: one for edges with two columns - start/end, and another for vertices with unique serial for vertex number and some columns for vertex description.

    I've implemented upsert based on pg_advisory_xact_lock. It was a bit slow, but it was enough for me.

    Also, it's a pain to delete vertex from this configuration.

    To speed up multiplication, I've exported edges table to file. It can even be placed in RAM on x64 machine.

    To be fair, the amount of data was less than I expected. Instead of 50 million vertices and average 200-300 edges for 1 vertex there were only 7 million vertices and 160 million edges total.