Search code examples
sqlsql-serverssisetldata-warehouse

When to use user defined functions in a SQL Server data warehouse


I am working on creating a DWH where I am loading the data in Staging DB and before loading them into final DB I apply all the udfs that I have created on the data.

  • Source DB : Oracle
  • Dest DB : SQL Server
  • ETL Process : SSIS packages

I was not processing anything on staging to have a quick load.

Question: is it quicker to apply any udfs when the data is in staging itself or should it be done when loading the data to final DB.

Below facility_cd is a float value and I am passing it to a function emr_get_code_Description to get the corresponding description. The table where it's getting the description from is in the final DB. udf_replace_special_char is a simple function which is replacing a few special characters with NULL.

LTRIM(RTRIM([Dest_DWH].[dbo].udf_replace_special_char([Dest_DWH].[dbo].[emr_get_code_Description](Stg_ap.Facility_cd))))

In general what should be a better practice? Should I be updating this in staging and then load the data after all conversions to Final DB.

Function definitions :

Function 1 :

USE [PROD_DWH]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER function [dbo].[emr_get_code_Description](@cv int)  
returns varchar(80)  
as begin   

-- Returns the code value display 
    declare @ret varchar(80)  
    select @ret = cv.DESCRIPTION
        from PROD_DWH.DBO.table cv   
        where cv.code_value = @cv   
            and cv.active_ind = 1  

    return isnull(@ret, 0)

end;

Function 2 :

USE [PROD_DWH]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

ALTER function [dbo].[udf_replace_special_char](@var varchar(1000))  
returns varchar(1000)  
as begin   
-- Returns the code value display 
    declare @return_var varchar(1000)  
    set @return_var = @var
    set @return_var = replace(@return_var,CHAR(13),'')
    set @return_var = replace(@return_var,CHAR(10),'')
    set @return_var = replace(@return_var,CHAR(09),'')
    set @return_var = replace(@return_var,CHAR(34),CHAR(39))

    return isnull(@return_var, 0)

end;

Solution

  • First of all, as @Nick.McDermaid mentioned in the comments: Best practice is to avoid using User defined functions. There are many links containing information about the functions effects on query performance.


    There is not ideal answer for these question, it is related to the case you are working with, but i can give some tips that you can take into consideration:

    • First, if you are using SSIS to import data into Staging Table, try replacing user defined function with the SSIS data flow components such as derived column transformation, Lookups, in a way that can enhance the performance of the data import.
    • If you cannot replace the UDF by SSIS components: If you are collecting data in high speed to a data lake (staging level) and then loading the data when needed, it is better to avoid using functions when importing data to staging table.
    • If You need to guarantee a high speed when loading data from staging table, then use the function in the first data import phase.
    • If the first data import phase (to staging table) and the second phase (from staging table) are not executed on the same machine, it could be better to execute functions on the more performant machine.
    • If function contains some operations like lookups, try replacing them with joins.

    ...

    Update 1

    After posting functions in your question, you can replace function 2 with a Derived Column Transformation in your SSIS package:

    ISNULL([Column]) ? "" : REPLACE(REPLACE(REPLACE(REPLACE([Column],CHAR(10),""),CHAR(13),""),CHAR(09),""),CHAR(34),CHAR(39))
    

    Also you can replace Function 1 with a Lookup Transformation in SSIS package or with a LEFT JOIN in SQL query.