Search code examples
sql-servert-sqllinq-to-sqlconcurrency

Concurrency - one process updates, another inserts


I have two processes that work with data in the same table.

One process inserts daily, one by one (pure ADO.NET), about 20000 records in the target table.

The second process calls ( periodically, every 15 minutes ) a stored procedure that

  1. Detects the duplicates in those 20000 records by looking at all the records 7 days back and marks them as such.
  2. Marks all records that are not duplicates with a 'ToBeCopied' flag.
  3. Select a number of columns from the records marked as 'ToBeCopied' and returns the set.

Sometimes these two processes overlap ( due to delays in data processing ) and I am suspecting that if the first process inserts new records when second process is somewhere between 1 and 2 then records will be marked 'ToBeCopied' without having gone through the duplicate sifting.

This means that now the store procedure is returning some duplicates.

This is my theory but In practice I have not been able to replicate it...

I am using LINQ to SQL to insert duplicates (40-50 or so a second) and while this is running I am manually calling the stored procedure and store its results.

It appears that when the stored procedure is running the inserting pauses ... such that at the end no duplicates have made it to the final result set.

I am wondering if LINQ to SQL or SQL Server has a default mechanism that prevents concurrency and is pausing the inserting while the selecting or updating takes place.

What do you think?

EDIT 1:

The 'duplicates' are not identical rows. They are 'equivalent' given the business/logical entities these records represent. Each row has a unique primary key.

P.S. Selecting the result set takes place with NOLOCK. Trying to reproduce on SQL Server 2008. Problem is alleged to occur on SQL Server 2005.


Solution

  • What do I think?

    • Why do you have duplicates in the database? Data purity begins in the client at the app drawing board, which should have a data model that simply does not allow for duplicates.
    • Why do you have duplicates in the database? Check constraints should prevent this from happening if the client app misbehaves
    • If you have duplicates, the reader must be prepared to handle them.
    • You cannot detect duplicates in two stages (look then mark), it has to be one single atomic mark. In fact, you cannot do almost anything in a database in two stages 'look and mark'. All 'look for record then mark the records found' processes fail under concurrency.
    • NOLOCK will give you inconsistent reads. Records will be missing or read twice. Use SNAPSHOT isolation.
    • Linq-To-SQL has no pixie dust to replace bad design.

    Update

    Consider this for instance:

    A staging table with a structure like:

     CREATE TABLE T1 (
      id INT NOT NULL IDENTITY(1,1) PRIMARY KEY, 
      date DATETIME NOT NULL DEFAULT GETDATE(), 
      data1 INT NULL, 
      data2 INT NULL, 
      data3 INT NULL);
    

    Process A is doing inserts at leisure into this table. It doe snot do any validation, it just dumps raw records in:

    INSERT INTO T1 (data1, data2, data3) VALUES (1,2,3);
    INSERT INTO T1 (data1, data2, data3) VALUES (2,1,4);
    INSERT INTO T1 (data1, data2, data3) VALUES (2,2,3);
    ...
    INSERT INTO T1 (data1, data2, data3) VALUES (1,2,3);
    INSERT INTO T1 (data1, data2, data3) VALUES (2,2,3);
    ...
    INSERT INTO T1 (data1, data2, data3) VALUES (2,1,4);
    ...
    

    Process B is tasked with extracting this staging table and moving cleaned up data into a table T2. It has to remove duplicates that, by business rules, mean records with same values in data1, data2 and data3. Within a set of duplicates, only the first record by date should be kept:

    set transaction isolation snapshot;
    declare @maxid int;
    
    begin transaction
    -- Snap the current max (ID) 
    --
    select @maxid = MAX(id) from T1;
    
    -- Extract the cleaned rows into T2 using ROW_NUMBER() to
    -- filter out duplicates
    --
    with cte as (
     SELECT date, data1, data2, datta3,
       ROW_NUMBER() OVER 
          (PARTITION BY data1, data2, data3 ORDER BY date) as rn
     FROM T1
     WHERE id <= @maxid)
    MERGE INTO T2 
    USING (
      SELECT date, data1, data2, data3
      FROM cte
      WHERE rn = 1
    ) s ON s.data1 = T2.data1
      AND s.data2 = T2.data2
      AND s.data3 = T2.data3
    WHEN NOT MATCHED BY TARGET
      THEN INSERT (date, data1, data2, data3)
      VALUES (s.date, s.data1, s.data2, s.data3);
    
    -- Delete the processed row up to @maxid
    --
    DELETE FROM T1 
      WHERE id <= @maxid;
    COMMIT;
    

    Assuming Process A only inserts, this procedure would safely process the staging table and extract the cleaned duplicates. Of course, this is just a skeleton, a true ETL process would have error handling via BEGIN TRY/BEGIN CATCH and transaction log size control via batching.