I have a postgres database where I have a table listing all the nodes in a graph and a table listing all the edges between those nodes. What I want to be able to do is specify a subset of all the nodes and then retrieve all the edges between those nodes.
The edge table has two columns to indicate which nodes it joins: left_node
and right_node
. So the query for the edges has to return edges where both the left_node and the right node are in the set of nodes specified eg. if there are 100 nodes but I'm interested in the connection between 20 of them.
I created the query below that checks to see that each node is in the set. It works but does not scale; with 4.6M edges and a set of just 10 nodes, the query takes 3 minutes (macbook pro M1 32GB ram). I need to scale to +100,000 nodes
Is there a better way to do this?
-- edge query
edge_query = """
select *
from edges ed
where ed.left_node_id in ({})
and ed.right_node_id in ({})
"""
-- execute query in pandas (or your favourite dbase ide)
import pandas.io.sql as sqlio
edges_retrieve = sqlio.read_sql(edge_query.format(<text_nodes_list>),<connection>)
To improve performance,please load the set that goes into the IN clause (<text_nodes_list>) into a table. Also, create indexes on columns used in the where clause.