Missing rows after insert fetch next offset optimize for

Context

It's only loading some of the data even though we expect it to load all of the data in one single run (aka 43ish iterations of the loop) I have to run this script many times to get all of the records. I have an idea that it has something to do with the fetch next, offset, and/or optimize for statements.

There are about 2 million records that need to be loaded from the CSV file. The code below is supposed to:

extracts about 500k from the file at a time in the bulk insert to a #tmp table
then it iterates about 43 times translating and inserting roughly 50k records each iteration of the data that's not already in the database.

Problem

It's only loading some of the data on a complete run. I have to run this script many times to get all of the records.

I have some other code that I've removed here for brevity that tracks the migration history and it shows that each time I run this it will catch more of the records and add them to the database. Honestly, I'm at a loss as to why this could be. I've tried logging and investigating and I'm stuck. Below you can see the number of times this was ran and the number of records/rows in the csv as ChecksumMigration then the number that was in the table when it started as ChecksumTableBefore and the number that was added to the database during the run as ChecksumTableAfter.

AppliedDateTime	ChecksumMigration	ChecksumTableBefore	ChecksumTableAfter
2024-05-06 00:20:05	2036473	1986473	2036473
2024-05-06 00:06:27	2036473	1936473	1986473
2024-05-05 23:57:51	2036473	1786473	1936473
2024-05-05 23:54:07	2036473	1536473	1786473
2024-05-05 23:49:35	2036473	1036473	1536473
2024-05-05 23:42:20	2036473	0	1036473

USE FoodData_Central;
DECLARE @beforeChecksum INT = 0;
DECLARE @afterChecksum INT = 0;
DECLARE @migrationName NVARCHAR(40) = N'2024 April Full from 2021 - ';  
DECLARE @pathToInputFolder NVARCHAR(40) = N'C:\FoodData_Central_csv_2024-04-18\';
DECLARE @tableName Nvarchar(40);
DECLARE @startTime DATETIME2 = GETDATE();

BEGIN TRY  
    set @tableName = 'food';
    
    -- CHECKSUM
    DECLARE @SQL NVARCHAR(MAX) = 'SELECT @ResultVariable = count(*) FROM ' + @tableName;
    EXEC sp_executesql @SQL, N'@ResultVariable INT OUTPUT', @ResultVariable = @beforeChecksum output;

    -- TRUNCATE TABLE food; --only truncated the first time and then use fetch next to get through the data, solving problem data along the way
    -- Mapping
    DROP TABLE IF EXISTS #tmp;
    create table #tmp(
        fdc_id NVARCHAR(max) NOT NULL, 
        data_type NVARCHAR(max) NULL, 
        description NVARCHAR(max) NULL,
        food_category_id NVARCHAR(max) NULL, 
        publication_date NVARCHAR(max) NULL
    )
    bulk insert #tmp
    From 'C:\FoodData_Central_csv_2024-04-18\food.csv' -- update file name, (!sometimes the file names are different <crosses eyes>)
    WITH
        (
            CODEPAGE = '65001'
            ,FIRSTROW = 2
            ,FIELDTERMINATOR = '\",\"'
            ,ROWTERMINATOR = '0x0A'   --Use to shift the control to next row
            ,batchsize=500000
            ,TABLOCK
        );

    DECLARE @i int = 1
    DECLARE @offsetCount int = 1;
    DECLARE @nextCount int = 50000;

    WHILE @i < 43
    BEGIN
        SET @i = @i + 1
        -- DDL
        insert into food(fdc_id, data_type, description, food_category_id, publication_date) -- UPDATE file name, and columns
            select 
                 CAST(REPLACE(t.fdc_id,'"','') AS INT) AS fdc_id
                , t.data_type
                , t.description
                , CAST(t.food_category_id AS SMALLINT) AS food_category_id
                , CAST(REPLACE(REPLACE(t.publication_date, '"', ''), CHAR(13), '') AS DATETIME2) AS publication_date
            from #tmp t
            WHERE NOT EXISTS ( -- skip duplicates
                SELECT 1 FROM food AS d --UPDATE
                WHERE d.fdc_id = CAST(REPLACE(t.fdc_id,'"','') AS INT)
            )
            ORDER BY fdc_id DESC

            OFFSET @offsetCount - 1 ROWS 
            FETCH NEXT @nextCount  - @offsetCount + 1 ROWS ONLY
            OPTION ( OPTIMIZE FOR (@offsetCount = 1, @nextCount = 2036474) ); 

        set @offsetCount = @offsetCount + 50000;
        set @nextCount = @nextCount + 50000;
    END

    -- CLEANUP
    DROP TABLE IF EXISTS #tmp;
END TRY  
BEGIN CATCH
    PRINT 'Error Number: ' + CAST(ERROR_NUMBER() AS NVARCHAR(10));
    PRINT 'Error Message: ' + ERROR_MESSAGE();
    -- CLEANUP
    DROP TABLE IF EXISTS #tmp;
END CATCH; 
GO

Edit:

Thanks to everyone who helped out with this! Including @MatBailie's recommendations for performance.

I used @MatBailie 's recommendation to create a culustered index for the temp table. That wasn't possible for the initial temp table, but I created a second temp table with an index, put the data there and dropped the first temp table. It cut the time in half even considering the additional subquery of this answer.

Solution

OK.. this one is a bit weird, but I believe the OFFSET/FETCH is running after the WHERE clause which makes the offset strange.

For example,

Round 1: you insert 50,000 rows (the WHERE finds no issues) with offset 0 and nextcount 50,000.
Round 2: it removes the 50,000 you originally inserted due to the WHERE clause. Your offset is 50,000 and nextcount 100,000. Therefore instead of starting at record number 50,001 it starts at record number 100,001 and you skip 50,000 (which would have been records 50,001 to 100,000).

Instead, try putting the OFFSET/FETCH into a subquery or separate query (e.g., select the next 50,000 records), then do the INSERT INTO with the WHERE clause only afterwards.

I think something like this works

insert into food(fdc_id, data_type, description, food_category_id, publication_date) -- UPDATE file name, and columns
    select fdc_id, data_type, description, food_category_id, publication_date
    from (select CAST(REPLACE(t.fdc_id,'"','') AS INT) AS fdc_id
                , t.data_type
                , t.description
                , CAST(t.food_category_id AS SMALLINT) AS food_category_id
                , CAST(REPLACE(REPLACE(t.publication_date, '"', ''), CHAR(13), '') AS DATETIME2) AS publication_date
            from #tmp t
            ORDER BY fdc_id DESC

            OFFSET @offsetCount - 1 ROWS 
            FETCH NEXT @nextCount  - @offsetCount + 1 ROWS ONLY
        ) AS a
    WHERE NOT EXISTS ( -- skip duplicates
        SELECT 1 FROM food AS d --UPDATE
        WHERE d.fdc_id = a.fdc_id AS INT)
        )
    OPTION ( OPTIMIZE FOR (@offsetCount = 1, @nextCount = 2036474) );

As per the comments below, @MatBailie suggested I add this to the answer:

If using the WHERE clause, it means you only need to select the first 50,000 rows (e.g., SELECT TOP (50000)) rather than using OFFSET/FETCH
To eliminate the need for the WHERE clause, delete any duplicates first from the temp table - then just insert as needed

These are probably better approaches to mine above.