I have two processes that work with data in the same table.
One process inserts daily, one by one (pure ADO.NET), about 20000 records in the target table.
The second process calls ( periodically, every 15 minutes ) a stored procedure that
Sometimes these two processes overlap ( due to delays in data processing ) and I am suspecting that if the first process inserts new records when second process is somewhere between 1 and 2 then records will be marked 'ToBeCopied' without having gone through the duplicate sifting.
This means that now the store procedure is returning some duplicates.
This is my theory but In practice I have not been able to replicate it...
I am using LINQ to SQL to insert duplicates (40-50 or so a second) and while this is running I am manually calling the stored procedure and store its results.
It appears that when the stored procedure is running the inserting pauses ... such that at the end no duplicates have made it to the final result set.
I am wondering if LINQ to SQL or SQL Server has a default mechanism that prevents concurrency and is pausing the inserting while the selecting or updating takes place.
What do you think?
EDIT 1:
The 'duplicates' are not identical rows. They are 'equivalent' given the business/logical entities these records represent. Each row has a unique primary key.
P.S. Selecting the result set takes place with NOLOCK. Trying to reproduce on SQL Server 2008. Problem is alleged to occur on SQL Server 2005.
What do I think?
Update
Consider this for instance:
A staging table with a structure like:
CREATE TABLE T1 (
id INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
date DATETIME NOT NULL DEFAULT GETDATE(),
data1 INT NULL,
data2 INT NULL,
data3 INT NULL);
Process A
is doing inserts at leisure into this table. It doe snot do any validation, it just dumps raw records in:
INSERT INTO T1 (data1, data2, data3) VALUES (1,2,3);
INSERT INTO T1 (data1, data2, data3) VALUES (2,1,4);
INSERT INTO T1 (data1, data2, data3) VALUES (2,2,3);
...
INSERT INTO T1 (data1, data2, data3) VALUES (1,2,3);
INSERT INTO T1 (data1, data2, data3) VALUES (2,2,3);
...
INSERT INTO T1 (data1, data2, data3) VALUES (2,1,4);
...
Process B
is tasked with extracting this staging table and moving cleaned up data into a table T2. It has to remove duplicates that, by business rules, mean records with same values in data1, data2 and data3. Within a set of duplicates, only the first record by date
should be kept:
set transaction isolation snapshot;
declare @maxid int;
begin transaction
-- Snap the current max (ID)
--
select @maxid = MAX(id) from T1;
-- Extract the cleaned rows into T2 using ROW_NUMBER() to
-- filter out duplicates
--
with cte as (
SELECT date, data1, data2, datta3,
ROW_NUMBER() OVER
(PARTITION BY data1, data2, data3 ORDER BY date) as rn
FROM T1
WHERE id <= @maxid)
MERGE INTO T2
USING (
SELECT date, data1, data2, data3
FROM cte
WHERE rn = 1
) s ON s.data1 = T2.data1
AND s.data2 = T2.data2
AND s.data3 = T2.data3
WHEN NOT MATCHED BY TARGET
THEN INSERT (date, data1, data2, data3)
VALUES (s.date, s.data1, s.data2, s.data3);
-- Delete the processed row up to @maxid
--
DELETE FROM T1
WHERE id <= @maxid;
COMMIT;
Assuming Process A
only inserts, this procedure would safely process the staging table and extract the cleaned duplicates. Of course, this is just a skeleton, a true ETL process would have error handling via BEGIN TRY/BEGIN CATCH and transaction log size control via batching.