I'm wondering how I can extract the contents of a hyperlink in HTML,
For instance:
<article id="post36">
<div>
<h3><a href="/blog/2019/4-14-canaries-in-the-coal-mine.html">Canaries in the Coal Mine</a></h3>
<p class="author">Posted by <a href="/blog/authors/moderator.html" rel="author">Moderator</a></p>
<p><time><span>Sunday, April 14th, 2019</span> — 8:17AM</time></p>
</div>
Other posts look like this (no external page):
<article id="post33">
<div>
<h3><a href="#post33">Landlines Win Again</a></h3>
<p class="author">Posted by <a href="/blog/authors/moderator.html" rel="author">Moderator</a></p>
<p><time><span>Friday, December 21st, 2018</span> — 7:14AM</time></p>
In an external script, I am passed the ID of a particular post. In this case, post 36 is below. I have a page containing all the post metadata in article tags like below.
I tried using catting the webpage (I have a local copy) and piping it to sed -n 's|[^<]*<article\([^<]*\)</article>[^<]*|\1\n|gp'
That sort of works. It only returns all of the article ids, like this:
<article id="post6">
<article id="post5">
<article id="post4">
<article id="post3">
<article id="post2">
<article id="post1">
My conclusion is that it only works on the current line. And when I try actually using the ID I get nothing: sed -n 's|[^<]*<article id="post36">\([^<]*\)</article>[^<]*|\1\n|gp'
My question here is how can I take advantage of the built-in Unix tools (sed, grep, awk, etc.) to extract the hyperlink? In this case, what I need is /blog/2019/4-14-canaries-in-the-coal-mine.html
Yes, I have consulted a number of SO posts like this one and this one, most of which discourage this kind of thing (I tried the native solutions but none worked). Two things:
You can single the interesting line with sed addresses. In this case, a regexp pattern to match the <a href
sed -nre '/h3.*href.*(#post[0-9]+|\/blog\/)/ s/.*<a href="([^"]+)".*/\1/p' test.html
/blog/2019/4-14-canaries-in-the-coal-mine.html
#post33
To match by article id add this in front of the sed
command
grep -A3 'article id="post36"' test.html | sed -nre '/h3.*href.*(#post[0-9]+|\/blog\/)/ s/.*<a href="([^"]+)".*/\1/p'