I would like to extract img src value from an XML file.
Test input:
<ROOT>
<ITEM>
<DESCRIPTION><![CDATA[<p align="left" dir="ltr">
<span lang="EN">lorem ipsum</span></p>
<p>
some text</p>
<p>
<img alt="" src="https://example.com/hello.jpg" /></p>
]]></DESCRIPTION>
</ITEM>
</ROOT>
What would be the best way to do it? With XSLT or an XML parser, like xmllint?
Currently I am trying with xmllint:
xmllint --xpath '//ROOT/ITEM/DESCRIPTION/text()' input.xml | egrep -o 'src=".*(\.png|\.jpg)'
...but output is like:
src="https://example.com/hello.jpg
Sure I can remove src="
, with tools like sed, but maybe there is a better and cleaner solution to extract links?
You need to dig deep with XPath 3 or XSLT 3 throwing in parse-xml-fragment
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
version="3.0">
<xsl:output method="text" indent="yes" html-version="5"/>
<xsl:template match="/">
<xsl:value-of select="ROOT/ITEM/DESCRIPTION/parse-xml-fragment(.)//img/@src"/>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NSSEv7
Saxon 9.9 HE is available in .NET, Java and C/C++/Python versions to run/use XSLT 3.
If the CDATA contains HTML that is not well-formed X(HT)ML then you could use the HTML parser implemented by David Carlisle in XSLT 2 (https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl):
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:html-parser="data:,dpc"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:value-of select="ROOT/ITEM/DESCRIPTION/html-parser:htmlparse(., '', true())//img/@src"/>
</xsl:template>
</xsl:stylesheet>