I've the following query:
INSERT INTO [Table A] ([student_name], [class_id],[contact_detail], [birth_date],[note_average])
SELECT [student_name] = case when CHARINDEX('.', [student_name])>0 then LEFT([student_name],CHARINDEX('.', [student_name])-1)
else [student_name] end
,[class_id]
,case when reverse(SUBSTRING(REVERSE([contact_detail]),1,CHARINDEX(':', REVERSE([contact_detail])))) like ':'
then ([contact_detail] + '|')
else [contact_detail]
end as [contact_detail]
,[birth_date]
,CAST([note_average] AS decimal(13,2)) as [note_average]
,GETDATE()
FROM [Table A]
WHERE CAST([birth_date] AS DATE) <= CAST(GETDATE() AS DATE)
AND LEN([student_name]) >= 5
AND NOT EXISTS
(
SELECT [student_name]
,[class_id]
,[contact_detail]
,[birth_date]
FROM [Table A] a
WHERE '%' + ods.[student_name] + '%' LIKE a.[student_name]
AND '%' + ods.[class_id] + '%' LIKE a.[class_id]
AND '%' + ods.[contact_detail] + '%' LIKE a.[contact_detail]
AND ods.[birth_date] = a.[birth_date]
)
GO
I don't want to insert duplicate values and I don't have an key in my table. My problem is: this query is taking a lot of time to insert the new values. I'm trying to insert 1000000 rows.
Which alternatives I've?
Many thanks!
Assuming you really want to match on exact duplicate matches, try something like this:
INSERT INTO [Table A] ([student_name], [os_name], [class_id],[contact_detail], [birth_date],[note_average])
SELECT ods.[student_name] = case when CHARINDEX('.', ods.[student_name])>0 then LEFT(ods.[student_name],CHARINDEX('.', ods.[student_name])-1)
else ods.[student_name] end
,ods.[class_id]
,case when reverse(SUBSTRING(REVERSE(ods.[contact_detail]),1,CHARINDEX(':', REVERSE(ods.[contact_detail])))) like ':'
then (ods.[contact_detail] + '|')
else ods.[contact_detail]
end as [contact_detail]
,ods.[birth_date]
,CAST(ods.[note_average] AS decimal(13,2)) as ods.[note_average]
,GETDATE()
FROM [Table A] ods
OUTER JOIN [Table A] a ON ods.[student_name] = a.[student_name]
AND ods.[os_name] = a.[os_name]
AND ods.[class_id] = a.[class_id]
AND ods.[contact_detail] = a.[contact_detail]
AND ods.[birth_date] = a.[birth_date]
WHERE CAST(ods.[birth_date] AS DATE) <= CAST(GETDATE() AS DATE)
AND LEN(ods.[student_name]) >= 5
-- Only include when no matching duplicate is found.
AND a.[student_name] IS NULL
You could stick with a sub-query and NOT EXISTS if you want and that would be fine as well but use SELECT 0 or similar instead of selecting all those columns, which isn't necessary. Changing the comparisons from LIKE to = will, I believe, give you the results you want along with making the query optimizable with indexes, if necessary.
AND NOT EXISTS
(
SELECT 0
FROM [Table A] a
WHERE ods.[student_name] = a.[student_name]
AND ods.[os_name] = a.[os_name]
AND ods.[class_id] = a.[class_id]
AND ods.[contact_detail] = a.[contact_detail]
AND ods.[birth_date] = a.[birth_date]
)
Once you have this query working correctly, then you can look at adding indexes if you need better performance. You might be able to just add an index on [birth_date] or [student_name] to get acceptable query performance.