Search code examples
javascriptarraysrecursionfilterdomparser

Regular expression for finding all sub strings which are NOT inside a HTML link element


Assuming I have the following HTML snippet:

Deep Work 
<p>
<a data-href="Deep Work" href="Deep Work" class="internal-link" 
target="_blank" rel="noopener">Deep Work</a>
</p>
Deep Work
<a href="blabla">Some other text</a>

Which regular expression will only match the two "Deep Work" text snippets which are located completely outside of the a-blocks? So, only the ones marked as yellow in this screenshot (not the red ones):

Screenshot

I tried multiple approaches, but always ended up getting a match for the last red one. Which I need to avoid. Thus I would appreciate any help from the community. Thanks!

Update: Unfortunately I simplified the HTML code above too much, using line breaks, to get it readable in StackOverflow. Here is better use case:

<p><a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a> Deep Work <a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a> Deep Work </p>

Again only the two "Deep Work" mentions outside any A-block should be found by the RegExp.


Solution

  • »Again only the two "Deep Work" mentions outside any A-block should be found by the RegExp.«

    Since the OP's example clearly shows, that the OP wants to match any node value (or text content) from just the first-level text-nodes, a solution based on DOMParser.parseFromString might look similar to the next provided example code ...

    const sampleMarkup =
      `Deep Work 1
      <p>
        <a data-href="Deep Work" href="Deep Work" class="internal-link" 
        target="_blank" rel="noopener">Deep Work</a>
      </p>
      Deep Work 2
      <a href="blabla">Some other text</a>`;
    
    console.log(
      Array
        // make an array from ...
        .from(
          (new DOMParser)
            // ... a parsed document ...
            .parseFromString(sampleMarkup, 'text/html')
            // ... body's ...
            .querySelector('body')
            // ... child nodes ...
            .childNodes
        )
        // ... and filter just the first level text nodes in order to ...
        .filter(node => node.nodeType === 3)
        // ... retrieve each matching text node's sanitized/trimmed text content.
        .map(node => node.nodeValue.trim())
    );
    .as-console-wrapper { min-height: 100%!important; top: 0; }

    From the above comments ...

    »The last edit, which provides the one-liner markup, implicitly changes the requirements for it is not equal to the before provided formatted html code. There is a difference in matching just all first level text node values (formatted code example) and matching any text node value which is not part of an <a/> element (the one-liner markup).«

    ... but as already mentioned ...

    »The task described by the OP is nothing that should be solved by regex (nor can a pure regex based approach assure 100% reliability on that matter). The OP should consider a DOMParser based approach.«

    ... due to a reliable approach, the refactoring can be achieved pretty easily ...

    // `<p>
    //   <a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">
    //     Deep Work
    //   </a>
    //   Deep Work 1
    //   <a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">
    //     Deep Work
    //   </a>
    //   Deep Work 2
    // </p>`
    
    const sampleMarkup = '<p><a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a>Deep Work 1<a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a>Deep Work 2</p>';
    
    function collectTextNodes(textNodeList, node) {
      const nodeType = node?.nodeType;
      if (nodeType === 1) {
    
        [...node.childNodes]
          .reduce(collectTextNodes, textNodeList);
    
      } else if (nodeType === 3) {
    
        textNodeList.push(node);
      }
      return textNodeList;
    }
    
    console.log(
      // ... collect any text node from within ...
      collectTextNodes(
        [],
        (new DOMParser)
          // ... a parsed document's ...
          .parseFromString(sampleMarkup, 'text/html')
          // ... body ...
          .querySelector('body')
      )
      // ... and filter any text node which is not located within an `<a/>` element ...
      .filter(textNode =>
        textNode.parentNode.closest('a') === null
      )
      // ... and retrieve each matching text node's sanitized/trimmed text content.
      .map(node =>
        node.nodeValue.trim()
      )
    );
    .as-console-wrapper { min-height: 100%!important; top: 0; }