Search code examples
sql-servernodesedges

Efficiently query nodes and edges using SQL


I have 2 SQL Server tables that store network information, the EF schemas are:

public partial class edge
{
    public long edge_id { get; set; }
    public string source { get; set; }
    public string target { get; set; }
    public Nullable<System.DateTime> edgedate { get; set; }
}

public partial class node
{
    public string node_id { get; set; }
    public string name { get; set; }
    public string address { get; set; }
}

I am passing edge and node specific filters from the UI to be built into SQL queries like this:

select * 
from [dbo].[Nodes] 
where name = 'John Doe' 
   or address = '123 Fake Street'  

select * 
from [dbo].[Edges] 
where edgedate >= '01/12/2020 00:00:00' 
  and edgedate <= '01/12/2021 23:59:59

However these queries have to account for the entire network, i.e. node filters have to be applied to edges and vice versa -

-- nodes example with edge filters applied
select * 
from [dbo].[Nodes] 
where name = 'John Doe' 
   or address = '123 Fake Street'  
   and node_id in (select source 
                   from EDGESTEMP 
                   where edgedate >= '01/12/2020 00:00:00' 
                     and edgedate <= '01/12/2021 23:59:59'
                   union 
                   select target 
                   from EDGESTEMP 
                   where edgedate >= '01/12/2020 00:00:00' 
                     and edgedate <= '01/12/2021 23:59:59') 

This works fine on small scale networks, however if I am dealing with networks of say a million edges and 500k nodes the performance to run these queries takes a hit, because of the in statements when checking the other table in each instance.

I have added indexes on all the affiliated columns for the queries however need to know if there is a more efficient way of doing this?

Additional Info

Query plan - here

Clustered indexes are set on each primary key i.e. node_id and edge_id and non clustered set on the rest e.g. -

CREATE NONCLUSTERED INDEX [NonClusteredIndex-20211017-194859] ON [dbo].[NODES]
(
    [name] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO

Solution

  • Firstly, your query appears to have a logic error: there should be parenthesis around the or.

    Secondly, UNION ALL is usually better than UNION, although it often doesn't matter in a semi-join such as IN or EXISTS

    select n.* 
    from [dbo].[Nodes] n
    where (n.name = 'John Doe' 
           or n.address = '123 Fake Street')
     and node_id in (
        select source
        from EDGES 
        where edgedate >= '01/12/2020 00:00:00' 
          and edgedate <= '01/12/2021 23:59:59'
        union all
        select target 
        from EDGES
        where edgedate >= '01/12/2020 00:00:00' 
          and edgedate <= '01/12/2021 23:59:59'
    );
    

    Finally, for this query, you should probably have the following indexes

    NODES (name) INCLUDE (node_id)
    
    NODES (address) INCLUDE (node_id)
    
    EDGES (edgedate) INCLUDE (source, target)
    

    The or condition may still cause issues, as you may still get an index scan on NODES. If so, you may need to rewrite the query to force an index union instead.

    select n.* 
    from (
        select *
          from [dbo].[Nodes] n
          where n.name = 'John Doe'
        union
        select *
          from [dbo].[Nodes] n
          where n.address = '123 Fake Street'
    ) n
    where node_id in (
        select source
        from EDGES 
        where edgedate >= '01/12/2020 00:00:00' 
          and edgedate <= '01/12/2021 23:59:59'
        union all
        select target 
        from EDGES
        where edgedate >= '01/12/2020 00:00:00' 
          and edgedate <= '01/12/2021 23:59:59'
    );