Search code examples
sql-serverxml-parsingcdata

Extract XML nodes in CDATA with missing tags


I have XML that contains html tags that are not closed.So I embedded CDATA inside it so it doesn't error out. How can I extract different XML nodes.

CREATE tABLE dbo.temp(ID int, input varchar(max))

INSERT into dbo.temp(1,'<?xml version="1.0" encoding="iso-8859-1"?>  <!DOCTYPE title [ <!ELEMENT title ANY > <!ENTITY xxe SYSTEM "https://grepular.com/xxe.txt" >]>
    <customer>
    <![CDATA[<TransmissionId>5555</TransmissionId>
    <HeadLine>Hair Loss &amp; Growth Treatments and Products Sales Market Research Report 2016-2021</p></HeadLine>
    ]]></customer>')

As we see there is a </p> without a matching

in the Headline node. How can I extract nodes from this

Thanks MR


Solution

  • You can use the string functions like left,right, and charindex, but I find that tedious. If you are open to a TVF which will extract values based on a supplied pattern. I should add, this is not limited to XML tags.

    Being a TVF one or many values will be returned.

    Example

    Declare @YourTable table (ID int,input varchar(max))
    Insert Into @YourTable Values
    (1,'<?xml version="1.0" encoding="iso-8859-1"?>  <!DOCTYPE title [ <!ELEMENT title ANY > <!ENTITY xxe SYSTEM "https://grepular.com/xxe.txt" >]>
        <customer>
        <![CDATA[<TransmissionId>5555</TransmissionId>
        <HeadLine>Hair Loss &amp; Growth Treatments and Products Sales Market Research Report 2016-2021</p></HeadLine>
        ]]></customer>')
    
    
    Select A.ID
          ,B.RetVal
     From  @YourTable A
     Cross Apply [dbo].[udf-Str-Extract](A.Input,'<HeadLine>','</HeadLine>') B
    

    Returns

    ID  RetVal
    1   Hair Loss &amp; Growth Treatments and Products Sales Market Research Report 2016-2021</p>
    

    The UDF if Interested

    CREATE FUNCTION [dbo].[udf-Str-Extract] (@String varchar(max),@Delimiter1 varchar(100),@Delimiter2 varchar(100))
    Returns Table 
    As
    Return (  
    
    with   cte1(N)   As (Select 1 From (Values(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) N(N)),
           cte2(N)   As (Select Top (IsNull(DataLength(@String),0)) Row_Number() over (Order By (Select NULL)) From (Select N=1 From cte1 N1,cte1 N2,cte1 N3,cte1 N4,cte1 N5,cte1 N6) A ),
           cte3(N)   As (Select 1 Union All Select t.N+DataLength(@Delimiter1) From cte2 t Where Substring(@String,t.N,DataLength(@Delimiter1)) = @Delimiter1),
           cte4(N,L) As (Select S.N,IsNull(NullIf(CharIndex(@Delimiter1,@String,s.N),0)-S.N,8000) From cte3 S)
    
    Select RetSeq = Row_Number() over (Order By N)
          ,RetPos = N
          ,RetLen = charindex(@Delimiter2,RetVal)-1
          ,RetVal = left(RetVal,charindex(@Delimiter2,RetVal)-1)
     From (Select A.N,RetVal = ltrim(rtrim(Substring(@String, A.N, A.L))) From cte4 A ) A
     Where charindex(@Delimiter2,RetVal)>1
    )
    /*
    Max Length of String 1MM characters
    
    Declare @String varchar(max) = 'Dear [[FirstName]] [[LastName]], ...'
    Select * From [dbo].[udf-Str-Extract] (@String,'[[',']]')
    */