sql-server azure-sql-database data-warehouse

Does Azure SQL Data Warehouse have a way to split strings?

Doing some research, I see that there are no good options to split strings in Azure SQL Data Warehouse. It doesn't have the new STRING_SPLIT() function or OPENJSON() function. It also doesn't allow SELECT statements in user defined functions to try and create your own like many of the custom splitter functions the community has made.

Thus, I figured I would pose the questions: Does SQL Data Warehouse have ways to split strings and what are the best options to take here?

Use Case

You have a field in a SQL table with the value, "My_Value_Is_Good". The objective is to split out each segment into separate fields using the delimiter underscore in either a SELECT statement or at most, written to a new table.

Solutions I've Used

The main one for me is just transforming the data before it lands in the data warehouse. I do this using Python to parse out the data. However, bigger datasets do slow this down and isolate this more to specific records once in the system too.

Solution

Update Jul 2019 - STRING_SPLIT is now available in Azure SQL Data Warehouse as per here. So in my example below, the code would be more like this:

DECLARE @delimiter CHAR(1) = '-';

CREATE TABLE dbo.guids_split
WITH
(
    DISTRIBUTION = HASH(xguid),
    HEAP
)
AS
SELECT *
FROM dbo.guids g
    CROSS APPLY STRING_SPLIT ( xguid, @delimiter );

Azure SQL Data Warehouse has a reduced T-SQL surface area as compared with normal SQL Server or Azure SQL Database. It does not have any of the fancy tricks such as STRING_SPLIT, table-valued functions, CLR, XML; even cursors are not allowed. In fact for all the techniques in one of the go-to articles on this topic (pre-SQL 2016) 'Split strings the right way - or the next best way', you can't use any of them, with the exception of the numbers table.

~~Therefore we need something a bit more procedural, avoiding loops of any kind. I have used the above article for inspiration, used an adapted version of the test data script and this approach:~~

-- Create one million guids
IF OBJECT_ID('dbo.numbers') IS NOT NULL DROP TABLE dbo.numbers
IF OBJECT_ID('dbo.guids_split') IS NOT NULL DROP TABLE dbo.guids_split
IF OBJECT_ID('dbo.guids') IS NOT NULL DROP TABLE dbo.guids
IF OBJECT_ID('tempdb..#tmp') IS NOT NULL DROP TABLE #tmp
GO


CREATE TABLE dbo.Numbers (
    Number  INT NOT NULL
)
WITH
(
    DISTRIBUTION = ROUND_ROBIN,     --!!TODO try distibuting?
    CLUSTERED INDEX ( Number )
)
GO


DECLARE @UpperLimit INT = 1000000;

;WITH n AS
(
    SELECT
        x = ROW_NUMBER() OVER (ORDER BY s1.[object_id])
    FROM       sys.all_objects AS s1
    CROSS JOIN sys.all_objects AS s2
    CROSS JOIN sys.all_objects AS s3
)
SELECT x
INTO #tmp
FROM n
WHERE x BETWEEN 1 AND @UpperLimit
GO

INSERT INTO dbo.Numbers ( Number )
SELECT x
FROM #tmp
GO


CREATE TABLE dbo.guids (
    rn  INT IDENTITY,
    xguid   CHAR(36) NOT NULL
)
WITH
(
    DISTRIBUTION = HASH(xguid),
    CLUSTERED COLUMNSTORE INDEX
)
GO

INSERT INTO dbo.guids ( xguid )
SELECT NEWID() xguid
FROM dbo.Numbers
GO -- 10    -- scale up 10 to 100, 1,000 etc

ALTER INDEX ALL ON dbo.guids REBUILD 
GO


-- Create the stats
CREATE STATISTICS _st_numbers_number ON dbo.numbers (number);
CREATE STATISTICS _st_guids_rn ON dbo.guids (rn);
CREATE STATISTICS _st_guids_xguid ON dbo.guids (xguid);
GO
-- multi-col stat?
:exit


-- NB The length of the guid; so we don't have to use VARCHAR(MAX)
DECLARE @delimiter VARCHAR(1) = '-';

CREATE TABLE dbo.guids_split
WITH
(
    DISTRIBUTION = HASH(xguid),
    HEAP
)
AS
SELECT
    s.rn,
    n.Number n,
    originalid AS xguid,
    LTRIM( RTRIM( SUBSTRING( s.xguid, n.Number + 1, CHARINDEX( @delimiter, s.xguid, n.Number + 1 ) - n.Number - 1 ) ) ) AS split_value
FROM (
    SELECT
        rn,
        xguid AS originalid,
        CAST( CAST( @delimiter AS VARCHAR(38) ) + CAST( xguid AS VARCHAR(38) ) + CAST( @delimiter AS VARCHAR(38) ) AS VARCHAR(38) ) AS xguid
        FROM dbo.guids
        ) s
    CROSS JOIN dbo.Numbers n
WHERE n.Number < LEN( s.xguid )
  AND SUBSTRING( s.xguid, n.Number, 1 ) = @delimiter;
GO


/*
SELECT TOP 10 * FROM dbo.guids ORDER BY rn;

SELECT *
FROM dbo.guids_split
WHERE rn In ( SELECT TOP 10 rn FROM dbo.guids ORDER BY rn )
ORDER BY 1, 2;
GO

*/

The script is now tested on ADW and worked satisfactorily over 100 million records. This ran in under 4 mins at only DWU 400 (at least once I had added the stats and removed the varchar(max) : ). The guids is however a slightly artificial example as the data is uniform in size and always only 5 parts to split.

Getting good performance out of Azure SQL Data Warehouse is really to do with minimising data movement via a good hash distribution key. Therefore please post some realistic sample data.

The other alternative is Azure Data Lake Analytics. ADLA supports federated queries to "query data where it lives", so you could query the original table using U-SQL, split it using the native .net method and output a which could easily be imported using Polybase. Let me know if you need more help with this approach and I'll do up an example.

The SQLCat team have since published this article on anti-patterns with SQL Data Warehouse, which this type of string processing might be considered an example of. Please read this article:

https://blogs.msdn.microsoft.com/sqlcat/2017/09/05/azure-sql-data-warehouse-workload-patterns-and-anti-patterns/