html xml xml-parsing html-parsing xmlstarlet

How to extraxt HTML elements from inside the "content:encoded" part of an RSS feed?

I am trying to generate a newsletter which, among other stuff, includes news entries which are present on the website as well. The website is built with WordPress and has an RSS feed, which is not actively used but now comes handy to parse the news entries.

I am writing a simple generator script in Bash using xmlstarlet. In particular I am able to get the title, the description and the URL for the news entries (I iterate over them using $itemnum as index):

TITLE=$(xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "title" feed.xml);
DESC=$(xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "description" feed.xml);
URL=$(xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "link" feed.xml);

But now I also want to get the URL for the thumbnail and the date of the news entry. Those are basically two different questions so I only ask about the thumbnail URL (regarding the date: it is easy to get from <pubDate>...</pubDate> but it is not localized). The URL is sitting in the <content:encoded>...</content:encoded> tag, which includes a lot of different HTML tags.

I know that xmlstarlet has a HTML option, but don't know how to use it when the HTML is embedded inside an XML element. If I try to parse the output of

xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "content:encoded" feed.xml | xmlstarlet sel -t -c "//img[@class='size-medium wp-image-2821 alignright'][1]"

it gives errors:

-:1.1: Start tag expected, '<' not found
&lt;div&gt;
^

The reason might be that when getting

xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "content:encoded" feed.xml

it translates all tag brackets < and > into < and > and I don't know how to work around it.

edit:

Here is how a news entry looks like:

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:atom="http://www.w3.org/2005/Atom"
    xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
    xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
    
    xmlns:georss="http://www.georss.org/georss"
    xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
    >

<channel>
    <title>This is the title</title>
    <atom:link href="https://link.to/feed" rel="self" type="application/rss+xml" />
    <link>https://website.url</link>
    <description>This is the description</description>
    <lastBuildDate>Wed, 20 Dec 2023 04:49:30 +0000</lastBuildDate>
    <language>de-DE</language>
    <sy:updatePeriod>
    hourly  </sy:updatePeriod>
    <sy:updateFrequency>
    1   </sy:updateFrequency>
    <generator>https://wordpress.org/?v=6.3.2</generator>
<site xmlns="com-wordpress:feed-additions:1">124249965</site>   

<item>
        <title>A title</title>
        <link>https://link.to/the-news-entry</link>
        
        <dc:creator><![CDATA[HP-Admin]]></dc:creator>
        <pubDate>Wed, 20 Dec 2023 04:49:30 +0000</pubDate>
                <category><![CDATA[Uncategorized]]></category>
        <guid isPermaLink="false">https://perma.link/p123</guid>

                    <description><![CDATA[a short description]]></description>
                                        <content:encoded><![CDATA[<p>A paragraph is written here.</p>
<p>Another paragraph is written here.</p>
<p><img decoding="async" fetchpriority="high" class="size-medium wp-image-2821 alignright" src="https://link.to/first-image.jpg" alt="" width="200" height="300" srcset="https://link.to/first-image.jpg 200w, https://link.to/first-image.jpg 683w, https://link.to/first-image.jpg 768w, https://link.to/first-image.jpg 1024w, https://link.to/first-image.jpg 1365w, https://link.to/first-image.jpg 367w, https://link.to/first-image.jpg 16w, https://link.to/first-image.jpg 24w, https://link.to/first-image.jpg 32w, https://link.to/first-image.jpg 1707w" sizes="(max-width: 200px) 100vw, 200px" /></p>
<p><img decoding="async" class="size-medium wp-image-2820 alignright" src="https://link.to/second-image.jpg" alt="" width="200" height="300" srcset="https://link.to/second-image.jpg 200w, https://link.to/second-image.jpg 683w, https://link.to/second-image.jpg 768w, https://link.to/second-image.jpg 1024w, https://link.to/second-image.jpg 1365w, https://link.to/second-image.jpg 367w, https://link.to/second-image.jpg 16w, https://link.to/second-image.jpg 24w, https://link.to/second-image.jpg 32w, https://link.to/second-image.jpg 1707w" sizes="(max-width: 200px) 100vw, 200px" /></p>
<p>Another paragraph is written here.</p>
<p>Another paragraph is written here.</p>
<p>Another paragraph is written here.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
]]></content:encoded>
                    
        
        
        <post-id xmlns="com-wordpress:feed-additions:1">2818</post-id>  </item>
    </channel>
</rss>

Now I noticed that both images are not actually the header image I need... The URL to the header image does not appear in the feed xml at all... I'm really puzzled why this happens.

Solution

To extract the simple variables, for example:

# shellcheck shell=sh disable=SC2016

xmlstarlet select -T -t \
  --var idx -o "${itemnum:-1}" -b \
  --var q1 -o "'" -b \
  -m '/rss/channel/item[$idx]' \
    -v 'concat("title=",$q1,str:replace(title,$q1,concat($q1,"\",$q1,$q1)),$q1)' -n \
    -v 'concat("desc=",$q1,str:replace(description,$q1,concat($q1,"\",$q1,$q1)),$q1)' -n \
    -v 'concat("url=",$q1,link,$q1)' -n \
    -v 'concat("pubdate=",$q1,pubDate,$q1)' -n \
feed.xml

where

xmlstarlet select's -T (aka --text) option is used for plaintext output
--var defines a named variable, see xmlstarlet.txt for examples
the itemnum shell variable is passed in using shell parameter expansion
any embedded single quote characters in the title and description elements are quoted (see example in output) using the EXSLT str:replace function

Output:

title='A '\''modified'\'' title'
desc='a short description'
url='https://link.to/the-news-entry'
pubdate='Wed, 20 Dec 2023 04:49:30 +0000'

With GNU date localize with e.g. date -Isec -d "${pubdate}".

To extract the image URLs from the embedded HTML, for example:

# shellcheck shell=sh disable=SC2016

xmlstarlet select -T -t \
  --var idx -o "${itemnum:-1}" -b \
  -v '/rss/channel/item[$idx]/content:encoded' \
feed.xml |
xmlstarlet format -R -H -D |
# tee /dev/stderr |
xmlstarlet select -T -t \
  --var cls -o "${class:-wp-image-2821}" -b \
  --var q1 -o "'" -b \
   -m 'str:split(//img[contains(@class,$cls)]/@srcset,",")' \
     --var url_sz='str:split(.," ")' \
     -v 'concat("url_",$url_sz[2],"=",$q1,$url_sz[1],"?width=",substring-before($url_sz[2],"w"),$q1)' -n

use xmlstarlet select's -T (aka --text) option for plaintext output
shell variables itemnum and class are passed in using shell parameter expansion
use xmlstarlet format --recover --html --drop-dtd to convert HTML 4.0 to XML, note that HTML entities such as   are converted (uncomment the tee line to have a look)
use xmlstarlet select and the XPath contains function to extract the appropriate img/@srcset text, str:split it, first by comma then by space, and concat the substrings to a useful format

Output:

url_200w='https://link.to/first-image.jpg?width=200'
url_683w='https://link.to/first-image.jpg?width=683'
url_768w='https://link.to/first-image.jpg?width=768'
url_1024w='https://link.to/first-image.jpg?width=1024'
url_1365w='https://link.to/first-image.jpg?width=1365'
url_367w='https://link.to/first-image.jpg?width=367'
url_16w='https://link.to/first-image.jpg?width=16'
url_24w='https://link.to/first-image.jpg?width=24'
url_32w='https://link.to/first-image.jpg?width=32'
url_1707w='https://link.to/first-image.jpg?width=1707'