Search code examples
xsltxml-parsingxmllint

Extract img src from cdata text in XML


I would like to extract img src value from an XML file.

Test input:

<ROOT>
   <ITEM>
      <DESCRIPTION><![CDATA[<p align="left" dir="ltr">
    <span lang="EN">lorem ipsum</span></p>
<p>
    some text</p>
<p>
    <img alt="" src="https://example.com/hello.jpg" /></p>
]]></DESCRIPTION>
    </ITEM>
</ROOT>         

What would be the best way to do it? With XSLT or an XML parser, like xmllint?

Currently I am trying with xmllint:

xmllint --xpath '//ROOT/ITEM/DESCRIPTION/text()' input.xml | egrep -o 'src=".*(\.png|\.jpg)'

...but output is like:

src="https://example.com/hello.jpg

Sure I can remove src=", with tools like sed, but maybe there is a better and cleaner solution to extract links?


Solution

  • You need to dig deep with XPath 3 or XSLT 3 throwing in parse-xml-fragment:

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema"
        exclude-result-prefixes="#all"
        version="3.0">    
    
      <xsl:output method="text" indent="yes" html-version="5"/>
    
      <xsl:template match="/">
         <xsl:value-of select="ROOT/ITEM/DESCRIPTION/parse-xml-fragment(.)//img/@src"/>
      </xsl:template>
    
    </xsl:stylesheet>
    

    https://xsltfiddle.liberty-development.net/3NSSEv7

    Saxon 9.9 HE is available in .NET, Java and C/C++/Python versions to run/use XSLT 3.

    If the CDATA contains HTML that is not well-formed X(HT)ML then you could use the HTML parser implemented by David Carlisle in XSLT 2 (https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl):

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema"
        xmlns:html-parser="data:,dpc"
        exclude-result-prefixes="#all"
        version="3.0">
    
      <xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
    
      <xsl:output method="text"/>
    
      <xsl:template match="/">
         <xsl:value-of select="ROOT/ITEM/DESCRIPTION/html-parser:htmlparse(., '', true())//img/@src"/>
      </xsl:template>
    
    </xsl:stylesheet>
    

    https://xsltfiddle.liberty-development.net/3NSSEv7/1