I am trying to optimize a T-SQL procedure in SQL Server 2017. I do not have direct access to the server, so I cannot change any settings and I can not enable FullTextSearch.
There exist two tables, one with a list of keywords (varchar(50)
) and a number code (int
), and another with large freely entered texts (varchar(max)
) and an ID (int
). I want to mark all these texts by which keywords are contained in another table. With a cursor, it works like this:
CREATE PROCEDURE markTexts
AS
BEGIN
DECLARE @code INT;
DECLARE @keyword VARCHAR(50);
DECLARE kcodes CURSOR LOCAL FAST_FORWARD FOR
SELECT code, keyword
FROM KeywordCode
ORDER BY code ASC;
OPEN kcodes;
FETCH NEXT FROM kcodes INTO @code, @keyword;
WHILE @@FETCH_STATUS = 0
BEGIN
INSERT INTO markedTexts (textid, keywordcode)
SELECT t.id, @code AS keywordcode
FROM fullTexts t
WHERE CHARINDEX(@keyword, t.fullText) > 0;
FETCH NEXT FROM kcodes INTO @code, @keyword;
END;
CLOSE kcodes;
DEALLOCATE kcodes;
END;
The table KeywordCode
has ~200 rows, fullTexts
has ~50k rows. Running the procedure yields ~20k results and takes roughly 30 seconds, which is too slow (at least I want it to go faster). Since I cannot use FullTextSearch (which would probably be faster with an index and all that...), I thought maybe a set-approach would be faster than the iterative cursor-approach. That would look like so:
INSERT INTO markedTexts (textid, keywordcode)
SELECT t.id, k.code
FROM fullTexts t
INNER JOIN KeywordCode k ON CHARINDEX(k.keyword, t.fullText) > 0;
It looks a lot simpler and cleaner and it does yield the same result... but it takes around 7 minutes (~3.5 times as long).
So my question is: why is the JOIN method so much slower than the cursor and is there any obvious way to speed this up?
First, you don't need that cursor, I think a simple join will do. Cursors are miserably slow and generally not necessary. Let's focus on the SELECT portion of your insert since that is the other problem. First some sample data:
-- Sample Data
CREATE TABLE dbo.KeywordCode(code INT, keyword VARCHAR(100));
CREATE TABLE dbo.fullTexts(id INT PRIMARY KEY, FullTxt VARCHAR(500));
INSERT dbo.KeywordCode
VALUES(1,'yada yada'),(2,'blah'),(3,'abc'),(4,'xxx a');
INSERT dbo.fullTexts
VALUES(10,'...yada yada yada'),(11,'xxx'),(12,'xxx abc123'),(24,'blah blah');
1. Solving the cursor issue
This should get you the same results much faster than using a cursor:
-- INSERT INTO markedTexts (textid, keywordcode)
SELECT t.id, k.keyword
FROM dbo.KeywordCode AS k
JOIN dbo.fullTexts AS t
ON CHARINDEX(k.keyword,t.FullTxt) > 0;
2. Solving the SARGability issue
Using CHARINDEX in a filter, such as a WHERE or JOIN clause, make the query non-SARGable which means the execution engine will be unable to perform a seek against an available index, resulting in a full table scan each time the query runs. Note the execution plan from the above SELECT query:
Though we've solved the cursor issue, we still have to avoid this scan. If possible, I suggest an indexed view.
CREATE VIEW dbo.kw_txt WITH SCHEMABINDING AS
SELECT t.id, k.keyword
FROM dbo.KeywordCode AS k
JOIN dbo.fullTexts AS t ON CHARINDEX(k.keyword,t.FullTxt) > 0;
Now, with the indexed view in place you will get this plan instead, which has the values pre-joined for you.
What's special about this approach is how you will get this improved plan, which leverages the indexed view, without even referencing the view in the query. Note that both of these queries will leverage the index on our new view:
SELECT t.id, t.keyword FROM dbo.kw_txt AS t;
As well as:
SELECT t.id, k.keyword
FROM dbo.KeywordCode AS k
JOIN dbo.fullTexts AS t ON CHARINDEX(k.keyword,t.FullTxt) > 0;
There are other, more sophisticated approaches to this type of issue but this is the approach I would go with based on the details you provided.