Search code examples
sqlsql-serverloopssql-server-2016

Missing rows after insert fetch next offset optimize for


Context

It's only loading some of the data even though we expect it to load all of the data in one single run (aka 43ish iterations of the loop) I have to run this script many times to get all of the records. I have an idea that it has something to do with the fetch next, offset, and/or optimize for statements.

There are about 2 million records that need to be loaded from the CSV file. The code below is supposed to:

  1. extracts about 500k from the file at a time in the bulk insert to a #tmp table
  2. then it iterates about 43 times translating and inserting roughly 50k records each iteration of the data that's not already in the database.

Problem

It's only loading some of the data on a complete run. I have to run this script many times to get all of the records.

I have some other code that I've removed here for brevity that tracks the migration history and it shows that each time I run this it will catch more of the records and add them to the database. Honestly, I'm at a loss as to why this could be. I've tried logging and investigating and I'm stuck. Below you can see the number of times this was ran and the number of records/rows in the csv as ChecksumMigration then the number that was in the table when it started as ChecksumTableBefore and the number that was added to the database during the run as ChecksumTableAfter.

AppliedDateTime ChecksumMigration ChecksumTableBefore ChecksumTableAfter
2024-05-06 00:20:05 2036473 1986473 2036473
2024-05-06 00:06:27 2036473 1936473 1986473
2024-05-05 23:57:51 2036473 1786473 1936473
2024-05-05 23:54:07 2036473 1536473 1786473
2024-05-05 23:49:35 2036473 1036473 1536473
2024-05-05 23:42:20 2036473 0 1036473
USE FoodData_Central;
DECLARE @beforeChecksum INT = 0;
DECLARE @afterChecksum INT = 0;
DECLARE @migrationName NVARCHAR(40) = N'2024 April Full from 2021 - ';  
DECLARE @pathToInputFolder NVARCHAR(40) = N'C:\FoodData_Central_csv_2024-04-18\';
DECLARE @tableName Nvarchar(40);
DECLARE @startTime DATETIME2 = GETDATE();

BEGIN TRY  
    set @tableName = 'food';
    
    -- CHECKSUM
    DECLARE @SQL NVARCHAR(MAX) = 'SELECT @ResultVariable = count(*) FROM ' + @tableName;
    EXEC sp_executesql @SQL, N'@ResultVariable INT OUTPUT', @ResultVariable = @beforeChecksum output;

    -- TRUNCATE TABLE food; --only truncated the first time and then use fetch next to get through the data, solving problem data along the way
    -- Mapping
    DROP TABLE IF EXISTS #tmp;
    create table #tmp(
        fdc_id NVARCHAR(max) NOT NULL, 
        data_type NVARCHAR(max) NULL, 
        description NVARCHAR(max) NULL,
        food_category_id NVARCHAR(max) NULL, 
        publication_date NVARCHAR(max) NULL
    )
    bulk insert #tmp
    From 'C:\FoodData_Central_csv_2024-04-18\food.csv' -- update file name, (!sometimes the file names are different <crosses eyes>)
    WITH
        (
            CODEPAGE = '65001'
            ,FIRSTROW = 2
            ,FIELDTERMINATOR = '\",\"'
            ,ROWTERMINATOR = '0x0A'   --Use to shift the control to next row
            ,batchsize=500000
            ,TABLOCK
        );

    DECLARE @i int = 1
    DECLARE @offsetCount int = 1;
    DECLARE @nextCount int = 50000;

    WHILE @i < 43
    BEGIN
        SET @i = @i + 1
        -- DDL
        insert into food(fdc_id, data_type, description, food_category_id, publication_date) -- UPDATE file name, and columns
            select 
                 CAST(REPLACE(t.fdc_id,'"','') AS INT) AS fdc_id
                , t.data_type
                , t.description
                , CAST(t.food_category_id AS SMALLINT) AS food_category_id
                , CAST(REPLACE(REPLACE(t.publication_date, '"', ''), CHAR(13), '') AS DATETIME2) AS publication_date
            from #tmp t
            WHERE NOT EXISTS ( -- skip duplicates
                SELECT 1 FROM food AS d --UPDATE
                WHERE d.fdc_id = CAST(REPLACE(t.fdc_id,'"','') AS INT)
            )
            ORDER BY fdc_id DESC

            OFFSET @offsetCount - 1 ROWS 
            FETCH NEXT @nextCount  - @offsetCount + 1 ROWS ONLY
            OPTION ( OPTIMIZE FOR (@offsetCount = 1, @nextCount = 2036474) ); 

        set @offsetCount = @offsetCount + 50000;
        set @nextCount = @nextCount + 50000;
    END

    -- CLEANUP
    DROP TABLE IF EXISTS #tmp;
END TRY  
BEGIN CATCH
    PRINT 'Error Number: ' + CAST(ERROR_NUMBER() AS NVARCHAR(10));
    PRINT 'Error Message: ' + ERROR_MESSAGE();
    -- CLEANUP
    DROP TABLE IF EXISTS #tmp;
END CATCH; 
GO  

Edit:

Thanks to everyone who helped out with this! Including @MatBailie's recommendations for performance.

I used @MatBailie 's recommendation to create a culustered index for the temp table. That wasn't possible for the initial temp table, but I created a second temp table with an index, put the data there and dropped the first temp table. It cut the time in half even considering the additional subquery of this answer.


Solution

  • OK.. this one is a bit weird, but I believe the OFFSET/FETCH is running after the WHERE clause which makes the offset strange.

    For example,

    • Round 1: you insert 50,000 rows (the WHERE finds no issues) with offset 0 and nextcount 50,000.
    • Round 2: it removes the 50,000 you originally inserted due to the WHERE clause. Your offset is 50,000 and nextcount 100,000. Therefore instead of starting at record number 50,001 it starts at record number 100,001 and you skip 50,000 (which would have been records 50,001 to 100,000).

    Instead, try putting the OFFSET/FETCH into a subquery or separate query (e.g., select the next 50,000 records), then do the INSERT INTO with the WHERE clause only afterwards.

    I think something like this works

    insert into food(fdc_id, data_type, description, food_category_id, publication_date) -- UPDATE file name, and columns
        select fdc_id, data_type, description, food_category_id, publication_date
        from (select CAST(REPLACE(t.fdc_id,'"','') AS INT) AS fdc_id
                    , t.data_type
                    , t.description
                    , CAST(t.food_category_id AS SMALLINT) AS food_category_id
                    , CAST(REPLACE(REPLACE(t.publication_date, '"', ''), CHAR(13), '') AS DATETIME2) AS publication_date
                from #tmp t
                ORDER BY fdc_id DESC
    
                OFFSET @offsetCount - 1 ROWS 
                FETCH NEXT @nextCount  - @offsetCount + 1 ROWS ONLY
            ) AS a
        WHERE NOT EXISTS ( -- skip duplicates
            SELECT 1 FROM food AS d --UPDATE
            WHERE d.fdc_id = a.fdc_id AS INT)
            )
        OPTION ( OPTIMIZE FOR (@offsetCount = 1, @nextCount = 2036474) );
    

    As per the comments below, @MatBailie suggested I add this to the answer:

    • If using the WHERE clause, it means you only need to select the first 50,000 rows (e.g., SELECT TOP (50000)) rather than using OFFSET/FETCH
    • To eliminate the need for the WHERE clause, delete any duplicates first from the temp table - then just insert as needed

    These are probably better approaches to mine above.