Search code examples
htmlawksedjqsequential

How to cut HTML file (drop anything outside two tags)?


When this is my HTML example document:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>title</title>
  </head>
  <body>
    <iframe></iframe>
    <div class="text">TEST</div>
    <div id="trend" data-app="openableBox" class="box sub-box">
        <div class="box-header">
            <h1><span>Highlights</span></h1>
        </div>
    </div>
  </body>
</html>

How can I extract

<iframe></iframe>
<div class="text">TEST</div>

by dropping everything before <iframe> and after (beginning with)<div id="trend">?

Thanks if you could help me.


Solution

  • Here is a solution that solves the general problem, assuming one wants to select a range of elements based on a "linearization" of the HTML. This solution uses pup to convert HTML to JSON, and then uses to perform the linearization, selection, and conversion back to HTML.

    program.jq

    The idea is to "linearize" the HTML by recursively hoisting the children to the top-level:

    # Emit a stream by hoisting .children recursively.
    # It is assumed that the input is an array, 
    # and that .children is always an array.
    def hoist:
      .[]
      | if type == "object" and has("children")
        then del(.children), (.children | hoist)
        else .
        end;
    
    def indexof(condition):
      label $out
      | foreach .[] as $x (null; .+1;
          if ($x|condition) then .-1, break $out else empty end)
        // null;
    
    # Reconstitute the HTML element
    def toHtml:
      def k: . as $in | (keys_unsorted - ["tag", "text"])
      | reduce .[] as $k (""; . + " \($k)=\"\($in[$k])\"");
      def t: if .text then .text else "" end;
      "<\(.tag)\(k)>\(t)</\(.tag)>"
      ;
    
    # Linearize and then select the desired range of elements
    [hoist]
    | indexof( .tag == "iframe") as $first
    | indexof( .tag == "div" and .id=="trend") as $last
    | .[$first:$last]
    | .[]
    | toHtml
    

    Invocation:

    pup 'json{}' < input.html | jq -rf program.jq
    

    Output:

    <iframe></iframe>
    <div class="text">TEST</div>