Search code examples
sqlsql-server-2014text-extraction

What is the most efficient way of extracting these integers from a string using SQL?


I have a table (in a SQL server 2014 database, accessed via SSMS) of engineering data, and I need to report the bolt lengths that are mentioned in a description field, marked in bold:

STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (8 X M24 X 140 LG C/W 1 NUT, 12 X M24 X 170 LG C/W 1 NUT)
STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (2 X M20 X 80 LG C/W 1 NUT, 6 X M20 X 90 LG C/W 1 NUT)
STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (8 X M24 X 130 LG C/W 1 NUT, 12 X M24 X 150 LG C/W 1 NUT)
STUD BOLT SET, ASME B1.1, ASTM A193 B7, ASME B1.1, ASTM A194 GR 2H, GALVANISED (28 x 1 3/4" UN8 x 270 LG C/W 1 NUT)

Note that there are two bolt lengths mentioned in the description (eg the first record has "140 LG" and "170 LG"), but I only need the second one, "170", which is also always the larger of the two. While the position of the number should remain a fixed length from the end of the string, the length varies between two and three digits.

So beyond using MID(string,15,3) inside a lot of nested if statements to determine the length and position of the number, I don't really know how best to go about doing this.

Any suggestions would be much appreciated.


Solution

  • Please try the following solution that is using tokenization via XML and XQuery.

    The desired value is not in the same position from the end of the string due to edge cases.

    That's why, first, we will find a position of the bolt length using if/else logic.

    After that the XPath predicate /root/r[last() - sql:column("t2.position")] does the job.

    SQL

    -- DDL and sample data population, start
    DECLARE @tbl TABLE (id INT IDENTITY PRIMARY KEY, tokens NVARCHAR(1024));
    INSERT INTO @tbl (tokens) VALUES
    ('STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (8 X M24 X 140 LG C/W 1 NUT, 12 X M24 X 170 LG C/W 1 NUT)'),
    ('STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (2 X M20 X 80 LG C/W 1 NUT, 6 X M20 X 90 LG C/W 1 NUT)'),
    ('STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (8 X M24 X 130 LG C/W 1 NUT, 12 X M24 X 150 LG C/W 1 NUT)'),
    ('STUD BOLT SET, ASME B1.1, ASTM A193 B7, ASME B1.1, ASTM A194 GR 2H, GALVANISED (28 x 1 3/4" UN8 x 270 LG C/W 1 NUT)'),
    ('STUD BOLT SET, ASME B1.1, ASTM A193 B7, ASME B1.1, ASTM A194 GR 2H (12 x 7/8" x 110 LG C/W NUT)')
    -- DDL and sample data population, end
    
    DECLARE @separator CHAR(1) = SPACE(1);
    
    SELECT *
        , bolt_length = c.value('(/root/r[last() - sql:column("t2.position")]/text())[1]', 'INT')
    FROM @tbl AS t
    CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' + 
        REPLACE(tokens, @separator, ']]></r><r><![CDATA[') + 
        ']]></r></root>' AS XML)) AS t1(c)
    CROSS APPLY (SELECT c.query('if (/root/r[last() - 1]/text() = "1") then 4
        else 3').value('.','INT')) AS t2(position);
    

    Output

    id tokens bolt_length
    1 STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (8 X M24 X 140 LG C/W 1 NUT, 12 X M24 X 170 LG C/W 1 NUT) 170
    2 STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (2 X M20 X 80 LG C/W 1 NUT, 6 X M20 X 90 LG C/W 1 NUT) 90
    3 STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (8 X M24 X 130 LG C/W 1 NUT, 12 X M24 X 150 LG C/W 1 NUT) 150
    4 STUD BOLT SET, ASME B1.1, ASTM A193 B7, ASME B1.1, ASTM A194 GR 2H, GALVANISED (28 x 1 3/4" UN8 x 270 LG C/W 1 NUT) 270
    5 STUD BOLT SET, ASME B1.1, ASTM A193 B7, ASME B1.1, ASTM A194 GR 2H (12 x 7/8" x 110 LG C/W NUT) 110