Search code examples
htmlregexpattern-matchingnsregularexpression

How to Skip Content from a tag <span class=""> </span> while regex search?


Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

I have a string which is html like this

<html>
  <div>
      <p>this is sample content</p>
  </div>
  <div>
      <p>this is another sample</p>
      <span class="test">this sample should not caught</span>
      <div>
       this is another sample
      </div>
  </div>
</html>

now i want to search the word sample from this string, here i should not get the "sample" which is inside the <span>...</span>

I want this to be done using regex, i tried a lot but i cant do it, any help is greatful.

Thanks in advance.


Solution

  • This is quite brittle and fails if there can be nested span tags. If you don't have those, try

    (?s)sample(?!(?:(?!</?span).)*</span>)
    

    This matches sample only if the next following span tag (if any) is not a closing tag.

    Explanation:

    (?s)          # Switch on dot-matches-all mode
    sample        # Match "sample".
    (?!           # only if it's not followed by the following regex:
     (?:          #  Match...
      (?!</?span) #   (unless we're at the start of a span tag)
      .           #   any character
     )*           #  any number of times.
     </span>      #  Match a closing span tag.
    )             # End of lookahead
    

    To match sample only if it's neither within a span nor a p, you can use

    (?s)sample(?!(?:(?!</?span).)*</span>)(?!(?:(?!</?p).)*</p>)
    

    But all this depends entirely on tags being unnested (i. e., no two tags of the same kind may be nested) and correctly balanced (which often isn't given with p tags).