Search code examples
sqlsql-serverstored-procedurespriority-queuequery-performance

Priority queue in SQL Server


I'm currently in the process of building a web crawler in C#. To queue the URLs which have yet to be crawled I use SQL Server. It works pretty fast, but it starts getting really large over time which slows down my stored procedures.

CREATE TABLE PriorityQueue
(
ID int IDENTITY(0,1) PRIMARY KEY,
absolute_url varchar (400),
depth int,
priorty int,
domain_host varchar (255),
);

CREATE INDEX queueItem ON PriorityQueue(absolute_url);
CREATE INDEX queueHost ON PriorityQueue(domain_host);

This is the table I use for my queue. The priority numbers from 1 to 5 with 1 being the highest priority. As you can see I also use indexes for my stored procedures down below.

Procedure for adding new items to the queue:

DROP PROCEDURE IF EXISTS dbo.Enqueue
GO
CREATE PROCEDURE dbo.Enqueue(@absolute_url varchar(255), @depth int, @priorty int, @host varchar(255))
AS
BEGIN
    INSERT INTO [WebshopCrawler].[dbo].[PriorityQueue] (absolute_url, depth, priorty, domain_host) VALUES (@absolute_url, @depth, @priorty, @host);
END
GO

Procedure for getting the item with the highest priority:

DROP PROCEDURE IF EXISTS dbo.Dequeue
GO
CREATE PROCEDURE dbo.Dequeue
AS
BEGIN
    SELECT top 1 absolute_url, depth, priorty
    FROM [WebshopCrawler].[dbo].[PriorityQueue]
    WHERE priorty = (SELECT MIN(priorty) FROM [WebshopCrawler].[dbo].[PriorityQueue])
END
GO

This one gets realy slow with larger data.

Procedure to delete the dequeued item:

DROP PROCEDURE IF EXISTS dbo.RemoveFromQueue
GO
CREATE PROCEDURE dbo.RemoveFromQueue(@absolute_url varchar(400))
AS
BEGIN
    DELETE 
    FROM [WebshopCrawler].[dbo].[PriorityQueue]
    WHERE absolute_url = @absolute_url
END
GO

I tried using lots of different indexes, but nothing seemed to make the procedures go any faster. I hope someone has an idea on how to improve this.


Solution

  • Please read Using tables as Queues. The important issues:

    • You must organize the table according to the dequeue strategy. Primary key in IDENTITY makes absolutely no sense. Use a clustered index based on priority and dequeue order.
    • You must dequeue atomically in a single statement, use DELETE ... OUTPUT ...

    So it should be something along these lines:

    CREATE TABLE PriorityQueue
    (
      priority int not null,
      enqueue_time datetime not null default GETUTCDATE(),
      absolute_url varchar (8000) not null,
      depth int not null,
      domain_host varchar (255) not null,
    );
    
    CREATE CLUSTERED INDEX PriorityQueueCdx on PriorityQueue(priority DESC, enqueue_time);
    
    CREATE PROCEDURE dbo.Dequeue
    AS
    BEGIN
        with cte as (
           SELECT top 1 absolute_url, depth, priority
           FROM [PriorityQueue] with (rowlock, readpast)
           ORDER BY priority DESC, enqueue_time)
         DELETE FROM cte
             OUTPUT DELETED.*;
    END
    GO