I am trying to generate a newsletter which, among other stuff, includes news entries which are present on the website as well. The website is built with WordPress and has an RSS feed, which is not actively used but now comes handy to parse the news entries.
I am writing a simple generator script in Bash using xmlstarlet. In particular I am able to get the title, the description and the URL for the news entries (I iterate over them using $itemnum
as index):
TITLE=$(xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "title" feed.xml);
DESC=$(xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "description" feed.xml);
URL=$(xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "link" feed.xml);
But now I also want to get the URL for the thumbnail and the date of the news entry. Those are basically two different questions so I only ask about the thumbnail URL (regarding the date: it is easy to get from <pubDate>...</pubDate>
but it is not localized). The URL is sitting in the <content:encoded>...</content:encoded>
tag, which includes a lot of different HTML tags.
I know that xmlstarlet has a HTML option, but don't know how to use it when the HTML is embedded inside an XML element. If I try to parse the output of
xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "content:encoded" feed.xml | xmlstarlet sel -t -c "//img[@class='size-medium wp-image-2821 alignright'][1]"
it gives errors:
-:1.1: Start tag expected, '<' not found
<div>
^
The reason might be that when getting
xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "content:encoded" feed.xml
it translates all tag brackets <
and >
into <
and >
and I don't know how to work around it.
edit:
Here is how a news entry looks like:
<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
xmlns:georss="http://www.georss.org/georss"
xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
>
<channel>
<title>This is the title</title>
<atom:link href="https://link.to/feed" rel="self" type="application/rss+xml" />
<link>https://website.url</link>
<description>This is the description</description>
<lastBuildDate>Wed, 20 Dec 2023 04:49:30 +0000</lastBuildDate>
<language>de-DE</language>
<sy:updatePeriod>
hourly </sy:updatePeriod>
<sy:updateFrequency>
1 </sy:updateFrequency>
<generator>https://wordpress.org/?v=6.3.2</generator>
<site xmlns="com-wordpress:feed-additions:1">124249965</site>
<item>
<title>A title</title>
<link>https://link.to/the-news-entry</link>
<dc:creator><![CDATA[HP-Admin]]></dc:creator>
<pubDate>Wed, 20 Dec 2023 04:49:30 +0000</pubDate>
<category><![CDATA[Uncategorized]]></category>
<guid isPermaLink="false">https://perma.link/p123</guid>
<description><![CDATA[a short description]]></description>
<content:encoded><![CDATA[<p>A paragraph is written here.</p>
<p>Another paragraph is written here.</p>
<p><img decoding="async" fetchpriority="high" class="size-medium wp-image-2821 alignright" src="https://link.to/first-image.jpg" alt="" width="200" height="300" srcset="https://link.to/first-image.jpg 200w, https://link.to/first-image.jpg 683w, https://link.to/first-image.jpg 768w, https://link.to/first-image.jpg 1024w, https://link.to/first-image.jpg 1365w, https://link.to/first-image.jpg 367w, https://link.to/first-image.jpg 16w, https://link.to/first-image.jpg 24w, https://link.to/first-image.jpg 32w, https://link.to/first-image.jpg 1707w" sizes="(max-width: 200px) 100vw, 200px" /></p>
<p><img decoding="async" class="size-medium wp-image-2820 alignright" src="https://link.to/second-image.jpg" alt="" width="200" height="300" srcset="https://link.to/second-image.jpg 200w, https://link.to/second-image.jpg 683w, https://link.to/second-image.jpg 768w, https://link.to/second-image.jpg 1024w, https://link.to/second-image.jpg 1365w, https://link.to/second-image.jpg 367w, https://link.to/second-image.jpg 16w, https://link.to/second-image.jpg 24w, https://link.to/second-image.jpg 32w, https://link.to/second-image.jpg 1707w" sizes="(max-width: 200px) 100vw, 200px" /></p>
<p>Another paragraph is written here.</p>
<p>Another paragraph is written here.</p>
<p>Another paragraph is written here.</p>
<p> </p>
<p> </p>
]]></content:encoded>
<post-id xmlns="com-wordpress:feed-additions:1">2818</post-id> </item>
</channel>
</rss>
Now I noticed that both images are not actually the header image I need... The URL to the header image does not appear in the feed xml at all... I'm really puzzled why this happens.
To extract the simple variables, for example:
# shellcheck shell=sh disable=SC2016
xmlstarlet select -T -t \
--var idx -o "${itemnum:-1}" -b \
--var q1 -o "'" -b \
-m '/rss/channel/item[$idx]' \
-v 'concat("title=",$q1,str:replace(title,$q1,concat($q1,"\",$q1,$q1)),$q1)' -n \
-v 'concat("desc=",$q1,str:replace(description,$q1,concat($q1,"\",$q1,$q1)),$q1)' -n \
-v 'concat("url=",$q1,link,$q1)' -n \
-v 'concat("pubdate=",$q1,pubDate,$q1)' -n \
feed.xml
where
xmlstarlet select
's -T
(aka --text
) option is used for plaintext output--var
defines a named variable, see
xmlstarlet.txt
for examplesitemnum
shell variable is passed in using
shell parameter expansiontitle
and description
elements are quoted (see example in output) using the
EXSLT str:replace functionOutput:
title='A '\''modified'\'' title'
desc='a short description'
url='https://link.to/the-news-entry'
pubdate='Wed, 20 Dec 2023 04:49:30 +0000'
With GNU date
localize with e.g. date -Isec -d "${pubdate}"
.
To extract the image URLs from the embedded HTML, for example:
# shellcheck shell=sh disable=SC2016
xmlstarlet select -T -t \
--var idx -o "${itemnum:-1}" -b \
-v '/rss/channel/item[$idx]/content:encoded' \
feed.xml |
xmlstarlet format -R -H -D |
# tee /dev/stderr |
xmlstarlet select -T -t \
--var cls -o "${class:-wp-image-2821}" -b \
--var q1 -o "'" -b \
-m 'str:split(//img[contains(@class,$cls)]/@srcset,",")' \
--var url_sz='str:split(.," ")' \
-v 'concat("url_",$url_sz[2],"=",$q1,$url_sz[1],"?width=",substring-before($url_sz[2],"w"),$q1)' -n
xmlstarlet select
's -T
(aka --text
) option for plaintext outputitemnum
and class
are passed in using
shell parameter expansionxmlstarlet format
--recover --html --drop-dtd
to convert
HTML 4.0 to XML, note that HTML entities such as
are
converted (uncomment the tee
line to have a look)xmlstarlet select
and the XPath contains
function to extract the appropriate img/@srcset
text,
str:split
it, first by comma then by space,
and concat
the
substring
s to a useful formatOutput:
url_200w='https://link.to/first-image.jpg?width=200'
url_683w='https://link.to/first-image.jpg?width=683'
url_768w='https://link.to/first-image.jpg?width=768'
url_1024w='https://link.to/first-image.jpg?width=1024'
url_1365w='https://link.to/first-image.jpg?width=1365'
url_367w='https://link.to/first-image.jpg?width=367'
url_16w='https://link.to/first-image.jpg?width=16'
url_24w='https://link.to/first-image.jpg?width=24'
url_32w='https://link.to/first-image.jpg?width=32'
url_1707w='https://link.to/first-image.jpg?width=1707'