When this is my HTML example document:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>title</title>
</head>
<body>
<iframe></iframe>
<div class="text">TEST</div>
<div id="trend" data-app="openableBox" class="box sub-box">
<div class="box-header">
<h1><span>Highlights</span></h1>
</div>
</div>
</body>
</html>
How can I extract
<iframe></iframe>
<div class="text">TEST</div>
by dropping everything before <iframe>
and after (beginning with)<div id="trend">
?
Thanks if you could help me.
Here is a solution that solves the general problem, assuming one wants to select a range of elements based on a "linearization" of the HTML. This solution uses pup
to convert HTML to JSON, and then uses jq to perform the linearization, selection, and conversion back to HTML.
The idea is to "linearize" the HTML by recursively hoisting the children to the top-level:
# Emit a stream by hoisting .children recursively.
# It is assumed that the input is an array,
# and that .children is always an array.
def hoist:
.[]
| if type == "object" and has("children")
then del(.children), (.children | hoist)
else .
end;
def indexof(condition):
label $out
| foreach .[] as $x (null; .+1;
if ($x|condition) then .-1, break $out else empty end)
// null;
# Reconstitute the HTML element
def toHtml:
def k: . as $in | (keys_unsorted - ["tag", "text"])
| reduce .[] as $k (""; . + " \($k)=\"\($in[$k])\"");
def t: if .text then .text else "" end;
"<\(.tag)\(k)>\(t)</\(.tag)>"
;
# Linearize and then select the desired range of elements
[hoist]
| indexof( .tag == "iframe") as $first
| indexof( .tag == "div" and .id=="trend") as $last
| .[$first:$last]
| .[]
| toHtml
pup 'json{}' < input.html | jq -rf program.jq
<iframe></iframe>
<div class="text">TEST</div>