Search code examples
azureazure-sql-serverazure-synapse

I want to extract a XML value for a matching tag stored as varchar(max) in MS Azure Synapse


I am new Azure was using SAS before now we are moving to azure synapse In current environment I want to extract a XML tag value stored in column C (varcharmax) as variable. [Dataset][1] [1]: https://i.sstatic.net/tbSIF.png Below XML is saved in the column C (PKDATA)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:DataSet xmlns:ns2="http://www.test.com/t/cn/el">
    <EnumObject>
        <name>Inpatient</name>
        <value>262784067</value>
        <radiobutton>false</radiobutton>
    </EnumObject>
    <StringObject>
        <name>xxx</name>
        <prompt></prompt>
        <value>/widget.jsp</value>
        <width>99</width>
    </StringObject>
</ns2:DataSet>

If name is Inpatient then 262784067 as Inpatient type

output

| A | B | Inpatient type | | 11212 | 2587140 | 262784067 |

I used the following code select a,b, pkdata.value('/EnumObject/name') as Inpatient type from dbo.extdata

i get the following error Cannot find either column "pkkddata" or the user-defined function or aggregate "pkdata.value", or the name is ambiguous.

I tried using following query but gives me the error Msg 104220, Level 16, State 1, Line 26 Cannot find data type 'xml'. SELECT a,b,(pkdata).value('(/EnumObject/name/text())[1]', 'varchar(100)') FROM [dbo].extdata CROSS APPLY (SELECT CAST(pkdata AS xml)) AS x(pkdata)

i get the following error when I use the below code The XMLDT method 'nodes' can only be invoked on columns of type xml. i tried to use the following but get incorrect syntax near passing select x.* from [dbo].[EXTDATA] rt cross join xmltable( '/EnumObject/name' passing xmltype(rt.pkdata) columns name number path 'name/@value' ) x

Not sure how to proceed

Azure SQL version Microsoft Azure SQL Data Warehouse - 10.0.16003.0 Apr 28 2021 04:55:16 Copyright (c) Microsoft Corporation


Solution

  • Azure Synapse Analytics, specifically dedicated SQL pools does not support the XML datatype or any of the functions accompanying it including FOR XML, .nodes, .value, .query, .modify etc.

    If you need this type of processing, you can either use traditional SQL Server, eg SQL Server 2019 or Azure SQL DB. One option would be to use Synapse Pipelines to move data there. As an alternative you could look at using Synapse Notebooks and some custom Python / Scala / c# code, but I have only done a simple test on this.

    Simple example in Scala:

    Cell 1

    // Get the table with the XML column from the database and expose as temp view
    val df = spark.read.synapsesql("yourPool.dbo.someXMLTable")
    
    df.createOrReplaceTempView("someXMLTable")
    

    Cell 2

    %%sql
    -- Use SparkSQL to interrogate the XML
    -- https://spark.apache.org/docs/2.3.0/api/sql/index.html#xpath
    SELECT
        colA,
        colB,
        xpath_string(pkData,'/DataSet/EnumObject[name="Inpatient"]/value') xvalue
    FROM someXMLTable
    

    Cell 3

    val df2 = spark.sql(""" 
    SELECT
        colA,
        colB,
        xpath_string(pkData,'/DataSet/EnumObject[name="Inpatient"]/value') xvalue
    FROM someXMLTable
    """)
    
    df2.show
    

    Cell 4

    // Write that dataframe back to the dedicated SQL pool
    df2.write.synapsesql("yourPool.dbo.someXMLTable_processed", Constants.INTERNAL)
    

    Screenprint from sample notebook: Synapse Notebook shredding XML

    XML is kind of old-fashioned these days - have you thought about switching to JSON? Also if your data volumes aren't that big it would be a lot cheaper to just Azure SQL DB rather than Synapse.