Search code examples
sqlpostgresqlquery-optimization

Speed up sql query where elements in two columns must belong to a common set


I have a postgres database where I have a table listing all the nodes in a graph and a table listing all the edges between those nodes. What I want to be able to do is specify a subset of all the nodes and then retrieve all the edges between those nodes.

The edge table has two columns to indicate which nodes it joins: left_node and right_node. So the query for the edges has to return edges where both the left_node and the right node are in the set of nodes specified eg. if there are 100 nodes but I'm interested in the connection between 20 of them.

I created the query below that checks to see that each node is in the set. It works but does not scale; with 4.6M edges and a set of just 10 nodes, the query takes 3 minutes (macbook pro M1 32GB ram). I need to scale to +100,000 nodes

Is there a better way to do this?

-- edge query
edge_query = """
    select *
    from edges ed
    where ed.left_node_id in ({})
    and ed.right_node_id in ({})
"""

-- execute query in pandas (or your favourite dbase ide)
import pandas.io.sql as sqlio

edges_retrieve = sqlio.read_sql(edge_query.format(<text_nodes_list>),<connection>)


Solution

  • To improve performance,please load the set that goes into the IN clause (<text_nodes_list>) into a table. Also, create indexes on columns used in the where clause.