Search code examples
sql-serverazure-sql-databasedata-warehouse

Does Azure SQL Data Warehouse have a way to split strings?


Doing some research, I see that there are no good options to split strings in Azure SQL Data Warehouse. It doesn't have the new STRING_SPLIT() function or OPENJSON() function. It also doesn't allow SELECT statements in user defined functions to try and create your own like many of the custom splitter functions the community has made.

Thus, I figured I would pose the questions: Does SQL Data Warehouse have ways to split strings and what are the best options to take here?

Use Case

You have a field in a SQL table with the value, "My_Value_Is_Good". The objective is to split out each segment into separate fields using the delimiter underscore in either a SELECT statement or at most, written to a new table.

Solutions I've Used

The main one for me is just transforming the data before it lands in the data warehouse. I do this using Python to parse out the data. However, bigger datasets do slow this down and isolate this more to specific records once in the system too.


Solution

  • Update Jul 2019 - STRING_SPLIT is now available in Azure SQL Data Warehouse as per here. So in my example below, the code would be more like this:

    DECLARE @delimiter CHAR(1) = '-';
    
    CREATE TABLE dbo.guids_split
    WITH
    (
        DISTRIBUTION = HASH(xguid),
        HEAP
    )
    AS
    SELECT *
    FROM dbo.guids g
        CROSS APPLY STRING_SPLIT ( xguid, @delimiter );
    

    Azure SQL Data Warehouse has a reduced T-SQL surface area as compared with normal SQL Server or Azure SQL Database. It does not have any of the fancy tricks such as STRING_SPLIT, table-valued functions, CLR, XML; even cursors are not allowed. In fact for all the techniques in one of the go-to articles on this topic (pre-SQL 2016) 'Split strings the right way - or the next best way', you can't use any of them, with the exception of the numbers table.

    Therefore we need something a bit more procedural, avoiding loops of any kind. I have used the above article for inspiration, used an adapted version of the test data script and this approach:

    -- Create one million guids
    IF OBJECT_ID('dbo.numbers') IS NOT NULL DROP TABLE dbo.numbers
    IF OBJECT_ID('dbo.guids_split') IS NOT NULL DROP TABLE dbo.guids_split
    IF OBJECT_ID('dbo.guids') IS NOT NULL DROP TABLE dbo.guids
    IF OBJECT_ID('tempdb..#tmp') IS NOT NULL DROP TABLE #tmp
    GO
    
    
    CREATE TABLE dbo.Numbers (
        Number  INT NOT NULL
    )
    WITH
    (
        DISTRIBUTION = ROUND_ROBIN,     --!!TODO try distibuting?
        CLUSTERED INDEX ( Number )
    )
    GO
    
    
    DECLARE @UpperLimit INT = 1000000;
    
    ;WITH n AS
    (
        SELECT
            x = ROW_NUMBER() OVER (ORDER BY s1.[object_id])
        FROM       sys.all_objects AS s1
        CROSS JOIN sys.all_objects AS s2
        CROSS JOIN sys.all_objects AS s3
    )
    SELECT x
    INTO #tmp
    FROM n
    WHERE x BETWEEN 1 AND @UpperLimit
    GO
    
    INSERT INTO dbo.Numbers ( Number )
    SELECT x
    FROM #tmp
    GO
    
    
    CREATE TABLE dbo.guids (
        rn  INT IDENTITY,
        xguid   CHAR(36) NOT NULL
    )
    WITH
    (
        DISTRIBUTION = HASH(xguid),
        CLUSTERED COLUMNSTORE INDEX
    )
    GO
    
    INSERT INTO dbo.guids ( xguid )
    SELECT NEWID() xguid
    FROM dbo.Numbers
    GO -- 10    -- scale up 10 to 100, 1,000 etc
    
    ALTER INDEX ALL ON dbo.guids REBUILD 
    GO
    
    
    -- Create the stats
    CREATE STATISTICS _st_numbers_number ON dbo.numbers (number);
    CREATE STATISTICS _st_guids_rn ON dbo.guids (rn);
    CREATE STATISTICS _st_guids_xguid ON dbo.guids (xguid);
    GO
    -- multi-col stat?
    :exit
    
    
    -- NB The length of the guid; so we don't have to use VARCHAR(MAX)
    DECLARE @delimiter VARCHAR(1) = '-';
    
    CREATE TABLE dbo.guids_split
    WITH
    (
        DISTRIBUTION = HASH(xguid),
        HEAP
    )
    AS
    SELECT
        s.rn,
        n.Number n,
        originalid AS xguid,
        LTRIM( RTRIM( SUBSTRING( s.xguid, n.Number + 1, CHARINDEX( @delimiter, s.xguid, n.Number + 1 ) - n.Number - 1 ) ) ) AS split_value
    FROM (
        SELECT
            rn,
            xguid AS originalid,
            CAST( CAST( @delimiter AS VARCHAR(38) ) + CAST( xguid AS VARCHAR(38) ) + CAST( @delimiter AS VARCHAR(38) ) AS VARCHAR(38) ) AS xguid
            FROM dbo.guids
            ) s
        CROSS JOIN dbo.Numbers n
    WHERE n.Number < LEN( s.xguid )
      AND SUBSTRING( s.xguid, n.Number, 1 ) = @delimiter;
    GO
    
    
    /*
    SELECT TOP 10 * FROM dbo.guids ORDER BY rn;
    
    SELECT *
    FROM dbo.guids_split
    WHERE rn In ( SELECT TOP 10 rn FROM dbo.guids ORDER BY rn )
    ORDER BY 1, 2;
    GO
    
    */
    

    The script is now tested on ADW and worked satisfactorily over 100 million records. This ran in under 4 mins at only DWU 400 (at least once I had added the stats and removed the varchar(max) : ). The guids is however a slightly artificial example as the data is uniform in size and always only 5 parts to split.

    Getting good performance out of Azure SQL Data Warehouse is really to do with minimising data movement via a good hash distribution key. Therefore please post some realistic sample data.

    The other alternative is Azure Data Lake Analytics. ADLA supports federated queries to "query data where it lives", so you could query the original table using U-SQL, split it using the native .net method and output a which could easily be imported using Polybase. Let me know if you need more help with this approach and I'll do up an example.

    The SQLCat team have since published this article on anti-patterns with SQL Data Warehouse, which this type of string processing might be considered an example of. Please read this article:

    https://blogs.msdn.microsoft.com/sqlcat/2017/09/05/azure-sql-data-warehouse-workload-patterns-and-anti-patterns/