How can I extract HTML content on a page by ID?
I tried exploring sed/grep solutions for an hour. None worked. I then gave in and explored HTML/XML parsers. html-xml-utils can only get an element by class, not ID, making it totally useless. I consulted the manual and it seems there's no way to get by id.
xmlstarlet seemed more promising, yet it whines when I try passing it HTML files rather than XML files. The following spits out at least 100 errors:
cat /home/com/interlinked/blog.html | tail -n +2 | xmlstarlet sel -T -t -m '/div/article[@id="post33"]' -v '.' -n
I used cat here because I don't want to modify the actual file. I used tail to cut out the DOCTYPE declaration which seemed to be causing issues earlier: Extra content at the end of the document
The content on the page is well formatted and consisted. Content looks like this:
<article id="post44">
... more HTML tags and content here...
</article>
I'd like to be able to extract everything between the specific article tags here by ID (e.g. if I pass it "44" it will return the contents of post44, if I pass it 34, it will return the contents of post34).
What sets this apart from other questions is I do not want just the content, I want the actual HTML between the article tags. I don't need the article tags themselves, though removing them is probably trivial.
Is there a way to do this using the built in Unix tools or xmlstarlet or html-xml-utils? I also tried the following sed which also failed to work:
article=`patt=$(printf 'article id="post%d"' $1); sed -n '/<$patt>/,/<\/article>/{ /article>/d; p }' $file`
Here I am passing in the file path as $file and and $1 is the blog post ID (44 or 34 or whatever). The reason for the two statements in one is because the $1 doesn't get evaluated within the sed statement otherwise because of the single quotes. That helps the variable resolve in a related grep command but not in this sed command.
Complete HTML structure:
<!doctype html>
<html lang="en">
<head>
<title>Page</title>
</head>
<body>
<header>
<nav>
<div id="sitelogo">
<a href="/"><img src="/img/logo/logo.png" alt="InterLinked"></img></a>
</div>
<ul>
<p>Menu</p>
</ul>
</nav>
<hr>
</header>
<div id="main">
<h1>Blog</h1>
<div id="bloglisting">
<article id="post44">
<p>Content</p>
</article>
<article id="post43">
</p>Content</p>
</article>
</div>
</div>
</body>
</html>
Also, to clarify, I need this to work on 2 different pages. Some posts are inline on this main page, but longer ones have their own page. The structure is similar, but not exactly the same. I'd like a solution that just finds the ID and doesn't need to worry about parent tags, if possible. The article tags themselves are formatted the same way on both kinds of pages. For instance, on a longer blog post with its own page, the different is here:
<div id="main">
<h1>Why Ridesharing Is Evil</h1>
<div id="blogpost">
<article id="post43">
<div>
In this case, the div bloglisting becomes blogpost. That's really the only big difference.
You can use the libxml2
tools to properly parse HTML/XML in proper syntax awareness. For your case, you can use xmllint
and ask it to parse HTML file with flag --html
and provide an xpath
query from the top-level to get the node of your choice.
For e.g. to get the content for post id post43
use a filter like
xmllint --html --xpath \
"//html/body/div[@id='main']/div[@id='bloglisting']/article[@id='post43']" html
If the xmllint
compiled on your machine does not understand a few recent (HTML5) tags like <article>
or <nav>
, suppress the warnings by adding 2>/dev/null
at the end of the command.
If you want to get only the contents within <article>
and not have the tags themselves, remove the first and last line by piping the result to sed
as below.
xmllint --html --xpath \
"//html/body/div[@id='main']/div[@id='bloglisting']/article[@id='post43']" html 2>/dev/null |
sed '1d; $d'
To use a variable for the post-id, define a shell variable and use it within the xpath
query
postID="post43"
xmllint --html --xpath \
"//html/body/div[@id='main']/div[@id='bloglisting']/article[@id='"$postID"']" html 2>/dev/null |
sed '1d; $d'