Extract img src from cdata text in XML

I would like to extract img src value from an XML file.

Test input:

<ROOT>
   <ITEM>
      <DESCRIPTION><![CDATA[<p align="left" dir="ltr">
    <span lang="EN">lorem ipsum</span></p>
<p>
    some text</p>
<p>
    <img alt="" src="https://example.com/hello.jpg" /></p>
]]></DESCRIPTION>
    </ITEM>
</ROOT>

What would be the best way to do it? With XSLT or an XML parser, like xmllint?

Currently I am trying with xmllint:

xmllint --xpath '//ROOT/ITEM/DESCRIPTION/text()' input.xml | egrep -o 'src=".*(\.png|\.jpg)'

...but output is like:

src="https://example.com/hello.jpg

Sure I can remove src=", with tools like sed, but maybe there is a better and cleaner solution to extract links?

Solution

You need to dig deep with XPath 3 or XSLT 3 throwing in parse-xml-fragment:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="#all"
    version="3.0">    

  <xsl:output method="text" indent="yes" html-version="5"/>

  <xsl:template match="/">
     <xsl:value-of select="ROOT/ITEM/DESCRIPTION/parse-xml-fragment(.)//img/@src"/>
  </xsl:template>

</xsl:stylesheet>

https://xsltfiddle.liberty-development.net/3NSSEv7

Saxon 9.9 HE is available in .NET, Java and C/C++/Python versions to run/use XSLT 3.

If the CDATA contains HTML that is not well-formed X(HT)ML then you could use the HTML parser implemented by David Carlisle in XSLT 2 (https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl):

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:html-parser="data:,dpc"
    exclude-result-prefixes="#all"
    version="3.0">

  <xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

  <xsl:output method="text"/>

  <xsl:template match="/">
     <xsl:value-of select="ROOT/ITEM/DESCRIPTION/html-parser:htmlparse(., '', true())//img/@src"/>
  </xsl:template>

</xsl:stylesheet>

https://xsltfiddle.liberty-development.net/3NSSEv7/1