Search code examples
asp.net.nethangfire

Why does Hangfire wait for 15s every few seconds when polling sql server for jobs?


I’ve inherited a system that uses Hangfire with sql server job storage. Usually when a job is scheduled to be run immediately we notice it takes a few seconds before it’s triggered.

Looking at SQL Profiler when running in my dev environment, the SQL run against Hangfire db looks like this -

exec sp_executesql N'delete top (1) JQ
output DELETED.Id, DELETED.JobId, DELETED.Queue
from [HangFire].JobQueue JQ with (readpast, updlock, rowlock, forceseek)
where Queue in (@queues1) and (FetchedAt is null or FetchedAt < DATEADD(second, @timeout, GETUTCDATE()))',N'@queues1 nvarchar(4000),@timeout float',@queues1=N'MYQUEUENAME_master',@timeout=-1800

-- Exactly the same SQL as above is executed about 6 times/second for about 3-4 seconds,
-- then nothing for about 2 seconds, then: 

exec sp_getapplock @Resource=N'HangFire:recurring-jobs:lock',@DbPrincipal=N'public',@LockMode=N'Exclusive',@LockOwner=N'Session',@LockTimeout=5000
exec sp_getapplock @Resource=N'HangFire:locks:schedulepoller',@DbPrincipal=N'public',@LockMode=N'Exclusive',@LockOwner=N'Session',@LockTimeout=5000
exec sp_executesql N'select top (@count) Value from [HangFire].[Set] with (readcommittedlock, forceseek) where [Key] = @key and Score between @from and @to order by Score',N'@count int,@key nvarchar(4000),@from float,@to float',@count=1000,@key=N'recurring-jobs',@from=0,@to=1596053348
exec sp_executesql N'select top (@count) Value from [HangFire].[Set] with (readcommittedlock, forceseek) where [Key] = @key and Score between @from and @to order by Score',N'@count int,@key nvarchar(4000),@from float,@to float',@count=1000,@key=N'schedule',@from=0,@to=1596053348
exec sp_releaseapplock @Resource=N'HangFire:recurring-jobs:lock',@LockOwner=N'Session'
exec sp_releaseapplock @Resource=N'HangFire:locks:schedulepoller',@LockOwner=N'Session'

-- Then nothing is executed for about 8-10 seconds, then: 

exec sp_executesql N'update [HangFire].Server set LastHeartbeat = @now where Id = @id',N'@now datetime,@id nvarchar(4000)',@now='2020-07-29 20:09:19.097',@id=N'ps12345:19764:fe362d1a-5ee4-4d97-b70d-134fdfab2b87'

-- Then about 500ms-2s later I get 
exec sp_executesql N'delete top (1) JQ ... -- i.e. Same as first query
The update LastHeartbeat query is only there every second time (from just a brief inspection, maybe that’s not exactly right).

It looks like there’s at least 3 threads running the DELETE query against JQ, since I can see several RPC:Starting before the RPC:Completed, suggesting they’re being executed in parallel instead of sequentially. I don’t know if that’s normal but seems weird as I thought we had just one ‘consumer’ of the jobs.

I only have one Queue in my dev environment, although in live we’d have 20-50 I’d guess.

Any suggestions on where I should look for the configuration that’s causing: a) the 8-10s pause between checking for jobs b) the number of threads that are checking for jobs - it seems like I have too many


After writing this I realised we were using an old version so I upgraded from 1.5.x to 1.7.12, upgraded the database, and changed the startup config to this:

        app.UseHangfireDashboard();

        GlobalConfiguration.Configuration
            .UseSqlServerStorage(connstring, new SqlServerStorageOptions
            {
                CommandBatchMaxTimeout = TimeSpan.FromMinutes(5),
                QueuePollInterval = TimeSpan.Zero,
                SlidingInvisibilityTimeout = TimeSpan.FromMinutes(5),
                UseRecommendedIsolationLevel = true,
                PrepareSchemaIfNecessary = true, // Default value: true
                EnableHeavyMigrations = true     // Default value: false
            })
            .UseAutofacActivator(_container);
        JobActivator.Current = new AutofacJobActivator(_container);

but if anything the problem is now worse. Or the same but faster: 20 calls to delete top (1) JQ... happen within about 1s now, then the other queries, then a 15s wait, then it starts all over again.

To be clear, the main problem is that if any jobs are added during that 15s delay then it'll take the remainder of that 15s before my job is executed. A second problem I think is it's hitting SQL Server more than needed: 20 times in a second is a bit much, for my needs at least.

(Cross-posted to hangfire forums)


Solution

  • If you don't set QueuePollInterval then Hangfire with sql server storage defaults to polling every 15s. So the first thing to do if you have this problem is set QueuePollInterval to something smaller, e.g. 1s.

    But in my case even when I set that it wasn't having any effect. The reason for that was calling app.UseHangfireServer() before I was calling GlobalConfiguration.Configuration.UseSqlServerStorage() with the SqlServerStorageOptions.

    When you call app.UseHangfireServer() it uses the current value of JobStorage.Current. My code had set that:

        var storage = new SqlServerStorage(connstring);
        JobStorage.Current = storage;
    

    then later called

        app.UseHangfireServer()
    

    then later called

            GlobalConfiguration.Configuration
                .UseSqlServerStorage(connstring, new SqlServerStorageOptions
            {
                CommandBatchMaxTimeout = TimeSpan.FromMinutes(5),
                QueuePollInterval = TimeSpan.Zero,
                SlidingInvisibilityTimeout = TimeSpan.FromMinutes(5),
                UseRecommendedIsolationLevel = true,
                PrepareSchemaIfNecessary = true, 
                EnableHeavyMigrations = true     
            })
    

    Reordering it to use SqlServerStorageOptions before app.UseHangfireServer() means the SqlServerStorageOptions take effect.