Search code examples
sql-serverxmlt-sqlsql-server-2019xquery-sql

Extract values from multiple XML columns each with multiple nodes


I have a table myTable with three XML columns of the same structure:

CREATE TABLE myTable 
(
    Field1 XML,
    Field2 XML,
    Field3 XML
);

INSERT INTO myTable (Field1, Field2, Field3) 
VALUES 
    ('<div class="Class1">
          <p>123</p>
          <p>456</p>
      </div>',
     '<div class="Class2">
          <p>abc</p>
          <p>def</p>
          </div>',
     '<div class="Class3">
          <p>XYZ</p>
          <p>AEIOU</p>
      </div>')

I would like to extract the values between the <p> tags, which I am able to accomplish for a single field using nodes(). However, when I try to bring in the other fields, it creates a cross reference of all the values, rather than showing each one's first, second, etc.

This is my code:

SELECT
    a.b.value('.','nvarchar(max)'),
    c.d.value('.','nvarchar(max)'),
    e.f.value('.','nvarchar(max)')
FROM
    myTable
CROSS APPLY 
    myTable.Field1.nodes('div/p') a(b)
CROSS APPLY 
    myTable.Field2.nodes('div/p') c(d)
CROSS APPLY 
    myTable.Field3.nodes('div/p') e(f)

Which produces output like this:

Field1 Field2 Field3
----------------------
123    abc    XYZ
123    abc    AEIOU
123    def    XYZ
123    def    AEIOU
456    abc    XYZ
456    abc    AEIOU
456    def    XYZ
456    def    AEIOU

I would like the output to be:

Field1 Field2 Field3
----------------------
123    abc    XYZ
456    def    AEIOU

The XML columns will always have the same number of nodes for a single row, but the number of nodes can vary from row to row.

Is there any way to accomplish this within a single query?


Solution

  • Annoyingly (as far as I am aware), you can't return the position value from a node, you can only filter to it (such as demonstrated in this answer from Roger Wolf).

    Instead what you could do is create a Tally of for the number of nodes in the first XML column to get a row for each p node for each ID. Then you can filter on the position() in your call to the nodes method and return 1 row per position.

    As you're on 2019, you don't have access to GENERATE_SERIES, so I use a UDF for the tally:

    --Create the UDF
    CREATE FUNCTION [fn].[Tally] (@LastNumber bigint, @Zero bit) 
    RETURNS table
    AS RETURN
    
        WITH N AS(
            SELECT N
            FROM (VALUES(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL))N(N)),
        Tally AS(
            SELECT 0 AS I
            WHERE @Zero = 0
              AND @LastNumber IS NOT NULL
            UNION ALL
            SELECT TOP (ISNULL(@LastNumber,0))
                   ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS I
            FROM N N1, N N2, N N3, N N4, N N5, N N6, N N7) --Up to 10,000,000 rows
        SELECT I
        FROM Tally T;
    GO
    --Create sample data
    CREATE TABLE dbo.YourTable (YourID int IDENTITY,
                                XML1 xml,
                                XML2 xml,
                                XML3 xml);
    INSERT INTO dbo.YourTable
    VALUES('<div class="Class1">
      <p>123</p>
      <p>456</p>
    </div>','<div class="Class2">
    <p>abc</p>
    <p>def</p>
    </div>','<div class="Class3">
    <p>XYZ</p>
    <p>AEIOU</p>
    </div>');
    GO
    --The actual solution
    WITH NodeCounts AS(
        SELECT YT.YourID,
               COUNT(*) AS Nodes
        FROM dbo.YourTable YT
             CROSS APPLY YT.XML1.nodes('/div/p') div(p)
        GROUP BY YT.YourID)
    SELECT YT.YourID,
           X1.p.value('(./text())[1]','varchar(30)') AS Field1,
           X2.p.value('(./text())[1]','varchar(30)') AS Field2,
           X3.p.value('(./text())[1]','varchar(30)') AS Field3
    FROM dbo.YourTable YT
         JOIN NodeCounts NC ON YT.YourID = NC.YourID
         CROSS APPLY fn.Tally(NC.Nodes,1) T
         CROSS APPLY YT.XML1.nodes('/div/p[position() = sql:column("T.I")]') X1(p)
         CROSS APPLY YT.XML2.nodes('/div/p[position() = sql:column("T.I")]') X2(p)
         CROSS APPLY YT.XML3.nodes('/div/p[position() = sql:column("T.I")]') X3(p);
    

    Alternatively, you could achieve this without a Tally and use ROW_NUMBER in the CTE. This might be more performant (I suggest racing your horses):

    WITH NodeCounts AS(
        SELECT YT.YourID,
               YT.XML1,
               YT.XML2,
               YT.XML3,
               ROW_NUMBER() OVER (PARTITION BY YT.YourID ORDER BY (SELECT NULL)) AS I
        FROM dbo.YourTable YT
             CROSS APPLY YT.XML1.nodes('/div/p') div(p))
    SELECT NC.YourID,
           X1.p.value('(./text())[1]','varchar(30)') AS Field1,
           X2.p.value('(./text())[1]','varchar(30)') AS Field2,
           X3.p.value('(./text())[1]','varchar(30)') AS Field3
    FROM NodeCounts NC
         CROSS APPLY NC.XML1.nodes('/div/p[position() = sql:column("NC.I")]') X1(p)
         CROSS APPLY NC.XML2.nodes('/div/p[position() = sql:column("NC.I")]') X2(p)
         CROSS APPLY NC.XML3.nodes('/div/p[position() = sql:column("NC.I")]') X3(p);
    

    db<>fiddle