Search code examples
sql-serversql-updatehashbytes

UPDATE of hashed field from a source View to target Table runs indefinitely


I have source View which is being used to populate a target Table. I recently added 2 new columns to the View and thus included them into the existing HASHBYTES function. Here is what the new content_hash looks like in the View:

CONVERT(NVARCHAR(100), HASHBYTES('SHA2_512',
                CONCAT(ISNULL(CONVERT(VARCHAR(30), FORMAT(tl.[TRANSACTION_ID],'#')), 'NULL TRANSACTIONUID'), '|'
                ,ISNULL(CONVERT(VARCHAR(30), FORMAT(tl.[COMPANY_ID],'#')), 'NULL PROJECTUID'), '|'
                ,ISNULL(CONVERT(VARCHAR(30), FORMAT(tl.[DEPARTMENT_ID],'#')), 'NULL OPERATINGGROUPUID'), '|'
                ,ISNULL(CONVERT(VARCHAR(30), FORMAT(d.[DEPARTMENT_ID],'#')), 'NULL SO_OPERATINGGROUPUID'), '|' --added per User Story #4626
                ,ISNULL(CONVERT(VARCHAR(50),cx.modifiedon), 'NULL IV_MODIFIEDON'), '|' --added per User Story #4652
                ,ISNULL(CONVERT(VARCHAR(30), FORMAT(tl.[OPENAIR_IS_PERSON_ID],'#')), 'NULL RESOURCEUID'), '|'
                ,ISNULL(CONVERT(VARCHAR(30), FORMAT(tl.[SUBSIDIARY_ID],'#')), 'NULL LEGALENTITYUID'), '|'
                ,ISNULL(CONVERT(VARCHAR(30), FORMAT(tl.[ACCOUNT_ID],'#')), 'NULL GLACCOUNTUID'), '|'
                ,ISNULL(CONVERT(VARCHAR(30), FORMAT(tl.[CLASS_ID],'#')), 'NULL GLCLASSUID'), '|'
                ,ISNULL(CONVERT(VARCHAR(30), FORMAT(c.[PARENT_ID],'#')), 'NULL CLIENTUID'), '|'
                ,ISNULL(CONVERT(VARCHAR(30), a.[TYPE_NAME]), 'NULL GLACCOUNTTYPEUID'), '|'
                ,ISNULL(CONVERT(VARCHAR(30), FORMAT(t.[TRANDATE],'YYYYMMDD')), 'NULL TRANDATE'), '|'
                ,ISNULL(CONVERT(VARCHAR(30), FORMAT(tl.[AMOUNT],'#.############')), 'NULL AMOUNT'), '|'
                ,ISNULL(CONVERT(VARCHAR(30), FORMAT(tl.[AMOUNT_LINKED],'#.############')), 'NULL AMOUNT'), '|'
                ,ISNULL(CONVERT(VARCHAR(30), FORMAT(tl.[AMOUNT_PENDING],'#.############')), 'NULL AMOUNT'), '|'
                ,ISNULL(CONVERT(VARCHAR(30), [NON_POSTING_LINE]), 'NULL NONPOSTINGLINE'), '|'
                ,ISNULL(CONVERT(VARCHAR(30), tl.[OPENAIR_ITEM_DESCRIPTION]), 'NULL Desc'), '|'
                ,ISNULL(CONVERT(VARCHAR(30), FORMAT(tl.[DATE_CREATED],'yyyyMMdd HH:MM:ss')), 'NULL TRANDATE'), '|'
                ,ISNULL(CONVERT(VARCHAR(30), FORMAT(tl.[DATE_LAST_MODIFIED_GMT],'yyyyMMdd HH:MM:ss')), 'NULL LAST_MODIFIED')
                ))) 
                as content_hash

The issue I'm having though is running an UPDATE that will set the target Table content_hash to the result of the content_hash from the View. The UPDATE just runs indefinitely and never finishes (I've let it go for up to 2hrs).

UPDATE tgt
    SET
        tgt.content_hash = src.content_hash
    FROM dbo.[gl_transaction_line] tgt
    INNER JOIN dbo.src_gl_transaction_line src 
        ON tgt.gl_transaction_line_uid = src.gl_transaction_line_uid

The resultset of the View is about 6.46M records. Running the full View in isolation takes about 11-12 mins which is how long it was taking before the 2 new fields were added.

Is there anything I can do to improve performance or an alternative way I can set the content_hash in the target Table to match the View?

This is just a one-time need where I need to update the content_hash across the entire table. It's because we don't want the stored procedure which normally loads the target table at night to see all of these content_hash differences and capture it in the history table. The target Table is normally temporal but I'm temporarily reverting it to non-temporal to make these (stealth) updates and then re-enabling it otherwise our history would double in size since every record is being touched by the addition of these two new fields.

EDIT I have limited permissions in the environment that I'm working in but will share what I can and post more if I can get it.

In general, we have stored procedures in our warehouse that execute as part of our nightly refresh. These Sp's compare the source Views to their target temporal Tables to determine whether an Insert, Update, Delete or nothing should happen. I'll share the beginning snippet of the related sp but this is why we're using the content_hash. If there's a better approach to all of this I'm all ears. I'm still relatively new to this environment so I can't speak to why things are architected the way they were but I'm glad to iterate towards a better solution.

    IF OBJECT_ID('tempdb.dbo.#processing_path') IS NOT NULL BEGIN DROP TABLE #processing_path END

SELECT          ISNULL(src.gl_transaction_line_uid, tgt.gl_transaction_line_uid) as gl_transaction_line_uid
                , CASE WHEN tgt.gl_transaction_line_uid IS NULL 
                        THEN 'Insert'
                    WHEN src.gl_transaction_line_uid IS NULL 
                        THEN 'Delete'
                    WHEN src.content_hash != tgt.content_hash
                        THEN 'Update'
                    ELSE 'Unchanged' END AS processing_path
                , tgt.version_begin_timestamp
INTO            #processing_path
FROM            dbo.src_gl_transaction_line src
 FULL JOIN      dbo.[gl_transaction_line] tgt ON src.gl_transaction_line_uid = tgt.gl_transaction_line_uid

CREATE UNIQUE CLUSTERED INDEX idx_uc ON #processing_path (gl_transaction_line_uid)

So depending on which processing_path that a record from the source View is assigned then that dictates whether those records are handled by the DELETE, UPDATE or INSERT queries further down in the sp.

EDIT 2 I was able to get a co-worker to capture the actual execution plan of the source View. I see there is an expensive clustered index scan (64% cost) and it's recommending a non-clustered index be created. I will try this tomorrow and see how that helps things.

https://www.brentozar.com/pastetheplan/?id=SyvbS1miT

Side Note: I did also remove all of the nested FORMAT functions inside the HASHBYTES function and even removed a couple of the CONVERT VARCHAR functions as they were unnecessary. Just doing this alone did improve the runtime performance of the View by a couple of minutes but the UPDATE attempt still ran indefinitely.


Solution

  • If the update (which uses the view) takes ages to run, I would suggest splitting it up into two separate processes for this 'one-off' run.

    Part 1: Get the data into a temporary table e.g.,

    CREATE TABLE #TempHashes (gl_transaction_line_uid int PRIMARY KEY, content_hash nvarchar(100));
    
    INSERT INTO #TempHashes (gl_transaction_line_uid, content_hash)
    SELECT gl_transaction_line_uid, content_hash
    FROM dbo.src_gl_transaction_line;
    

    Part 2: Then simply do an update of the original table, with the new hashes, joining on the ID, using the pre-calculated hashes.

    I find that this sort of strategy tends to work - I cannot truly explain why (there are people here who probably could) but I suspect that it has to do with how it is running the hashing view, and in particular, how many times it runs. The simpler 'just put the results into a temporary table' means it runs the hashing calculations as efficiently as possible; then the update using the temporary table is also very easy/efficient for the planner to create a good execution plan.

    NOTE: As per comments, there are many issues with this view. The above is as a way to get the job done in the current situation with minimal other changes - assuming the situation is that it typically "works" (no matter how inefficiently) but you need to do this one-off whole-table data update.