I recently asked a question about how to solve a problem in tsql query which led me to use a MERGE statement. This is however proving problematic as its performance is horrible.
What I need to do is insert rows based on a result set and save away the id of the inserted row along with the data that it resulted from (see related question).
I ended up with a query like this.
DECLARE @temp AS TABLE(
[action] NVARCHAR(20)
,[GlobalId] BIGINT
,[Personnumber] NVARCHAR(100)
,[Firstname] NVARCHAR(100)
,[Lastname] NVARCHAR(100)
);
;WITH person AS
(
SELECT top 1
t.[Personnumber]
,t.[Firstname]
,t.[Lastname]
FROM [temp].[RawRoles] t
WHERE t.Personnumber NOT IN
(
SELECT i.Account FROM [security].[Accounts] i
)
)
MERGE [security].[Identities] AS tar
USING person AS src
ON 0 = 1 -- all rows from src need to be inserted, ive already filtered out using CTE Query.
WHEN NOT MATCHED THEN
INSERT
(
[Created], [Updated]
)
VALUES
(
GETUTCDATE(), GETUTCDATE()
)
OUTPUT $action, inserted.GlobalId, src.[Personnumber], src.[Firstname], src.[Lastname] INTO @temp;
SELECT * FROM @temp
Using this query I'm inserting all the rows and then saving them away to a temp table along with the source values for later processing.
This works great on sub 10k rows. But the data set I'm doing this against is close to 2 million rows. I ran this query for about an hour without it completing (on a juiced up premium tier Azure database).
Question: How can I make this faster. Can I achieve this same result without Merge?
At first glance, the MERGE does not seem to be the culprit for the degraded performance. The merge condition is always false (0=1) and insertion (into [security].[Identities]) is the only possible path/way forward.
How long does it take to insert 2million rows into @temp, bypassing [security].[Identities] and MERGE ?
DECLARE @temp AS TABLE(
[action] NVARCHAR(20)
,[GlobalId] BIGINT
,[Personnumber] NVARCHAR(100)
,[Firstname] NVARCHAR(100)
,[Lastname] NVARCHAR(100)
);
--is this fast?!?
INSERT INTO @temp(action, GlobalId, Personnumber, Firstname, LastName)
SELECT 'insert', 0, t.[Personnumber], t.[Firstname], t.[Lastname]
FROM [temp].[RawRoles] t
WHERE t.Personnumber NOT IN
(
SELECT i.Account FROM [security].[Accounts] i
);
To check:
What is the datatype of [temp].[RawRoles].Personnumber ?
is Personnumber nvarchar(100)?
Do you need to store foreign characters in a person's number?
Nchar is double the size of char. varchar/char could be a better choice if you have alphanumeric(common latin characters) or digits with leading zeros. If your requirements can be met with a numeric datatype then int/bigint/decimal would be preferred.
Is there an index on [temp].[RawRoles].Personnumber? Without an index, the existence check would need to sort [temp].[RawRoles].Personnumber or hash it. This could be an added cost for the resource throughput/dtu. A clustered index on [temp].RawRoles could be the most beneficial, considering the majority of temp.RawRoles will get finally processed/inserted.
What is the datatype of [security].[Accounts].Account? Is there an index on the column? The two columns [security].[Accounts].Account & [temp].[RawRoles].Personnumber should be of the same datatype and ideally with an index on both. If [security].[Accounts] is the final destination of the processed [temp].[RawRoles], then the table could hold millions of rows and an index on the Account column is required for any future processing. The downside of the index is slower inserts. If the 2 million is the very first bulk/data, it would be best not to have an index on Account when inserting the "bulk" into security.Accounts (but create it afterwards).
To sum up:
--contemplate&decide whether a change of the Account datatype is needed. (a datatype change can have many implications, for applications using the db)
--change the data type of Personnumber to the datatype of Account(security.Accounts)
ALTER TABLE temp.RawRoles ALTER COLUMN Personnumber "datatype of security.Accounts.Account" NOT NULL; -- rows having undefined Personnumber?
--clustered index Personnumber
CREATE /*UNIQUE*/ CLUSTERED INDEX uclxPersonnumber ON temp.RawRoles(Personnumber); --unique preferred, if possible
--index on account (not needed[?] when security.Accounts is empty)
CREATE INDEX idxAccount ON [security].Accounts(Account);
--baseline, how fast can we do a straight forward insertion of 2 million rows?
DECLARE @tempbaseline AS TABLE(
[action] NVARCHAR(20)
,[GlobalId] BIGINT
,[Personnumber] NVARCHAR(100) --ignore this for now
,[Firstname] NVARCHAR(100)
,[Lastname] NVARCHAR(100)
);
INSERT INTO @tempbaseline([action], GlobalId, Personnumber, Firstname, LastName)
SELECT 'INSERT', 0, t.[Personnumber], t.[Firstname], t.[Lastname]
FROM [temp].[RawRoles] t
WHERE NOT EXISTS (SELECT * FROM [security].[Accounts] i WHERE i.Account = t.Personnumber)
--if the execution time (baseline) is acceptable, proceed with the merge code
--"merge with output into" should be be "slightly"/s slower than the baseline.
--if the baseline is not acceptable (simple insertion takes too much time) then merge is futile
/*
DECLARE @temp....
MERGE [security].[Identities] AS tar
USING
(
SELECT --top 1
t.[Personnumber]
,t.[Firstname]
,t.[Lastname]
FROM [temp].[RawRoles] t
WHERE NOT EXISTS (SELECT * FROM [security].[Accounts] i WHERE i.Account = t.Personnumber)
) AS src
ON 0 = 1 -- all rows from src need to be inserted, ive already filtered out in the USING Query.
WHEN NOT MATCHED THEN
INSERT
(
[Created], [Updated]
)
VALUES
(
GETUTCDATE(), GETUTCDATE()
)
OUTPUT 'INSERT' /** only insert is possible $action */, inserted.GlobalId, src.[Personnumber], src.[Firstname], src.[Lastname] INTO @temp;
--delete the index on Account (the process will insert 2mil)
DROP INDEX idxAccount ON [security].Accounts --review and create this index after the bulk of accounts is inserted.
...your process
*/