Search code examples
sql-serverxmlt-sql

How to get the superset of all XML paths from an XML column in a table


I have a table with an XML column. I have no data dictionary for the column or definition for the XML structure. There are about 1.5 million records. From visual inspection of a small sample, the structure of the XML appears to be fairly similar. What I'd like to be able to do is to run some SQL that will give me the superset of XML structure across all the records.

The table is called Remittance and the column is called RemittInstr.

For example, if I have two test data records whose XML value in the RemittInstrcolumn is:

row 1:

<ri>
    <Msg Type="MT103">
        <AccountNo>12345678</AccountNo>
        <Description code="ORCSR">Ordering Customer<Description>
    </Msg>
</ri>

row 2:

<ri>
    <Msg Type="MT202">
        <BICFI>ABCD1234</BICFI>
        <Description code="FI">Financial Institution<Description>
    </Msg>
</ri>

How do I write a query that will return the following rows:

/ri
/ri/Msg
/ri/Msg/@Type
/ri/Msg/AccountNo
/ri/Msg/BICFI
/ri/Msg/Description
/ri/Msg/Description/@code

So that I can get a full picture of the structure of the XML across all the rows?

Edit: This is sufficiently different to the linked questions as the linked questions only deal with a single XML value. The question is about finding the structure of the XML across the rows in an entire table. The solution to doing this is sufficiently different. @Charlieface has provided a link to a good solution in his comments below.


Solution

  • You can use a recursive CTE for this. I would warn you that this is likely to be very slow on 1.5m rows.

    It's a bit complicated to get attribute names, because it doesn't seem you can do .nodes('@*') to get all attributes.

    Instead you need to CROSS APPLY a union of the node name and its atttributes.

    WITH cte AS (
        SELECT
          xpath = r.RemittanceInstr.value('local-name(.)','nvarchar(max)'),
          child = r.RemittanceInstr
        FROM Remittance r
    
        UNION ALL
    
        SELECT
          xpath = CONCAT(cte.xpath, '/', v2.name),
          v2.child
        FROM cte
        CROSS APPLY cte.child.nodes('*') x(nd)
        CROSS APPLY (VALUES (x.nd.value('local-name(.)','nvarchar(max)') )) v(name)
        CROSS APPLY (
            SELECT
              v.name,
              x.nd.query('*')
    
            UNION ALL
    
            SELECT
              CONCAT(v.name, '/@', x2.attr.value('local-name(.)','nvarchar(max)')),
              NULL
            FROM x.nd.nodes('@*') x2(attr)
        ) v2(name, child)
    )
    SELECT DISTINCT
      xpath
    FROM cte;
    

    db<>fiddle