Search code examples
javascriptregexdomtraversal

How to wrap all occurances of known text inside an HTML document?


I want to find all TEXT or HREF matching a RegExp within an HMTL document and wrap those with a tag (e.g. turning plain text into links).

Consider the following HTML:

<body>
  <!-- test1 <div>test2 <a href="test3">test4</a></div> -->
  test5
  <a href="test6">notest</a>
  <div>
    test8
    <p>
      test9 notest test10
      <a href="notest">test12</a>
      <input type="text" name="test13">test14</input>
    </p>
    test15
  </div>
</body>

Then this would be my required replacement:

<body>
  <!-- test1 <div>test2 <a href="test3">test4</a></div> -->
  <div class="wrapped">test5</div>
  <div class="wrapped"><a href="test6">notest</a></div>
  <div>
    <div class="wrapped">test8</div>
    <p>
      <div class="wrapped">test9</div> notest
      <div class="wrapped">test10</div>
      <div class="wrapped"><a href="notest">test12</a></div>
      <input type="text" name="test13">test14</input>
    </p>
    <div class="wrapped">test15</div>
  </div>
</body>

Notice that tests 5, 6, 8, 9, 10, 12, 15 got wrapped.

It is not acceptable to insert into input boxes or any other special HTML tags that are not displayed (e.g. <script> <doctype> and so on).

I was working with a stack principle before:

  1. Push body onto stack.

  2. e = stack.pop().

  3. Push all children of e of type element onto stack, except links (<a> nodes) and elements of class="wrapped".

  4. Check all remaining e.children of type link for a matching href or text and wrap.

  5. Wrap all innermost matches within all e.children of type text.

  6. If stack is not empty, then go to 2.

  7. Complete

The JavaScript is only required to run on Firefox 8.

I would like to accomplish the wrappings without a tree traversal, linear would be optimal


Solution

  • Why do you not want any tree traversal? I think your current algorithm is as good as it gets.

    The problem is that the DOM does not offer any sophisticated method to get all text nodes.

    I didn't run any performance tests, but this one may have about the same speed:

    1. nodes := getElementsByTagName('*')
    2. excludes := document.querySelectorAll('a, a *, .wrapped, .wrapped *, script, style, input, textarea [, ...]')
      (querySelectorAll should perform pretty well)
    3. targets := nodes - excludes
      (not sure about the performance here)
    4. Iterate over targets
      • Iterate over children
      • Wrap each textNode
    5. Handle <a> elements separately