Search code examples
sqlsql-serverpatindex

MS SQL PATINDEX and Regex


Been banging my head at this for a few trying to come up with a clever way to regex (or if someone has a better idea) a size value from a very random string. I've setup the below test to show what I'm working with, I'm trying to get the Final output in this statement to be like these in Green, and not these in Red by using the @pattern in the query to pull out the...

Number x Number

Hoping there's a regex or string guru out there that might have an idea for me :)

enter image description here

DECLARE @pattern AS VARCHAR(100)
SET @pattern = '%[0-9][0-9. x][0-9.x ][0-9. x]%'

BEGIN

    WITH cte AS (
        SELECT 'Italy Terrazzo Sacra Nero 24x24 Honed' [Name]
        UNION
        SELECT 'Nero Marquina 1x3 Herringbone' [Name]
        UNION
        SELECT 'Myorka Blue 2x8' [Name]
        UNION
        SELECT 'Chrysler Driftwood Hickory Sea 12mil Wear Layer Glue Down 6.3x48.4' [Name]
        UNION
        SELECT 'Myorka Blue 2x8' [Name]
        UNION
        SELECT 'Nero Marquina 1x3 Herringbone' [Name]
        UNION
        SELECT 'Broadway Lvt Ash 9x72 Rigid Core Click- 6.0mm/28mil Wear Layer' [Name]
        UNION
        SELECT 'Minetta Concreto Pearl 2.5mm/28mil Wear Layer Glue Down 18x36' [Name]
        UNION
        SELECT 'Speak Leather Black 24 x 24' [Name]
        UNION
        SELECT 'Accent Montana White 12 x 36 Glossy' [Name]
    )


    SELECT 
         --FULL NAME--
          [Name]
         
         --HELPERS TO SEE WHAT I'M TESTING--
         ,PATINDEX('%[0-9]%x%', [Name]) AS [START]
         ,SUBSTRING([Name], PATINDEX('%[0-9]%x%', [Name]), LEN([Name]))
         ,REVERSE([Name]) AS [REVERSE]
         ,LEN([Name]) AS [STRLenght]
         ,PATINDEX(@pattern, REVERSE([Name])) + 1
         ,LEN([Name]) - PATINDEX('%[0-9][^A-z]x%', REVERSE([Name])) + 1 AS [END]

         --FULL CALCULATION FOR FINAL OUTPUT--
         ,CASE WHEN [Name] LIKE '%[0-9] x [0-9]%'
               THEN SUBSTRING([Name], PATINDEX('%[0-9]%x%', [Name]), (LEN([Name]) - PATINDEX('%[0-9]%x%', REVERSE([Name])) + 1) - PATINDEX('%[0-9]%x%', [Name]) + 1)
               WHEN [Name] LIKE '%[0-9]x[0-9]%'
               THEN REPLACE(SUBSTRING([Name], PATINDEX(@pattern, [Name]), (LEN([Name]) - PATINDEX(@pattern, REVERSE([Name])) + 1) - PATINDEX(@pattern, [Name]) + 1), 'x', ' x ')
               ELSE NULL
            END AS [Final]
          FROM cte
END

--EDIT--

I'm still trying to make sense of the query from Yitzhak has suggested. There are still a few outliers I'm trying to attack by changing your

    ,c.query('
        for $x in /root/r[lower-case(text()[1])="x"]
        let $pos := count(root/r[. << $x]) + 1
        let $before := /root/r[$pos - 1]
           ,$twobefore := /root/r[$pos - 2]
           ,$after := /root/r[$pos + 1]

        return
            if (xs:decimal($before[1]) instance of xs:decimal and
                xs:decimal($after[1]) instance of xs:decimal and
                xs:string($twobefore[1]) instance of xs:string)
                    then data(($before, $x, $after))
            else if (xs:decimal($before[1]) instance of xs:decimal and
                     xs:decimal($after[1]) instance of xs:decimal and
                     xs:decimal($twobefore[1]) instance of xs:decimal)
                        then data(($twobefore, $before, $x, $after))
            else()
    ').value('text()[1]', 'VARCHAR(20)') AS result

This was in an attempt to attack this outlier case but doesn't seem to be getting the $twobefore which what I thought would get the 1 instead of missing it.

Baroque Crackled 1 3/4 X 6 Chair Rail Blanco


Solution

  • Please try the following solution based on tokenization.

    It is leveraging SQL Server built-in XML and XQuery functionality.

    Notable points:

    • CROSS APPLY is tokenizing input string as XML.
    • XQuery FLWOR expression traversing XML and filtering out any token that is not 'x' via the XPath predicate [lower-case(text()[1])="x"].
    • /root/r[$pos - 1] and /root/r[$pos + 1] XPath predicates get preceding and following tokens.
    • where clause is checking for a decimal data type for at least one of the surrounding values.

    SQL

    -- DDL and sample data population, start
    DECLARE @tbl TABLE (id INT IDENTITY PRIMARY KEY, tokens VARCHAR(256));
    INSERT @tbl (tokens) VALUES
    ('Italy Terrazzo Sacra Nero 24x24 Honed'),
    ('Nero Marquina 1x3 Herringbone'),
    ('Myorka Blue 2x8'),
    ('Chrysler Driftwood Hickory Sea 12mil Wear Layer Glue Down 6.3x48.4'),
    ('Nero Marquina 1x3 Herringbone'),
    ('Broadway Lvt Ash 9x72 Rigid Core Click- 6.0mm/28mil Wear Layer'),
    ('Minetta Concreto Pearl 2.5mm/28mil Wear Layer Glue Down 18x36'),
    ('Speak Leather Black 24 x 24'),
    ('Accent Montana White 12 x 36 Glossy'),
    ('Baroque Crackled 1/2 X 6 Pencil Capri'),
    ('Banza Green Cielo 8" Hex'),
    ('Masia Torello Nero Brillo 3/4x12 Pencil'),
    ('Baroque Crackled 1 3/4 X 6 Chair Rail Blanco');
    -- DDL and sample data population, end
    
    DECLARE @separator CHAR(1) = SPACE(1);
    
    SELECT t.*
        , c.query('
            for $x in /root/r[lower-case(text()[1])="x"]
            let $pos := count(root/r[. << $x]) + 1
            let $before2 := /root/r[$pos - 2],
                $before := /root/r[$pos - 1],
                $after := /root/r[$pos + 1]
            where xs:decimal($before[1]) instance of xs:decimal 
                or xs:decimal($after[1]) instance of xs:decimal 
            return data((if (xs:decimal($before2[1]) instance of xs:decimal) then $before2 else ()
                , $before, $x, $after))
        ').value('text()[1]', 'VARCHAR(20)') AS result
    FROM @tbl AS t
        CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' + 
            REPLACE(REPLACE(REPLACE(tokens,'x', ' x '),SPACE(2),SPACE(1)), @separator, ']]></r><r><![CDATA[') + 
            ']]></r></root>' AS XML)) AS t1(c)
    ORDER BY id;
    

    Output

    id tokens result
    1 Italy Terrazzo Sacra Nero 24x24 Honed 24 x 24
    2 Nero Marquina 1x3 Herringbone 1 x 3
    3 Myorka Blue 2x8 2 x 8
    4 Chrysler Driftwood Hickory Sea 12mil Wear Layer Glue Down 6.3x48.4 6.3 x 48.4
    5 Nero Marquina 1x3 Herringbone 1 x 3
    6 Broadway Lvt Ash 9x72 Rigid Core Click- 6.0mm/28mil Wear Layer 9 x 72
    7 Minetta Concreto Pearl 2.5mm/28mil Wear Layer Glue Down 18x36 18 x 36
    8 Speak Leather Black 24 x 24 24 x 24
    9 Accent Montana White 12 x 36 Glossy 12 x 36
    10 Baroque Crackled 1/2 X 6 Pencil Capri 1/2 x 6
    11 Banza Green Cielo 8" Hex NULL
    12 Masia Torello Nero Brillo 3/4x12 Pencil 3/4 x 12
    13 Baroque Crackled 1 3/4 X 6 Chair Rail Blanco 1 3/4 x 6