Search code examples
regexrubyhtml-parsingruby-1.9.3

Regex to match anything except HTML tags when code is encoded using < and >


I am trying to use regex to match any text except for HTML tags. I have found this solution for "normal" HTML code:

<[^>]*>(*SKIP)(*F)|[^<]+

However, my code is encoded using &lt; and &gt; instead of < and >, and I have not been able to modify the regex above for it to work.

As an example, given the text:

Hi &lt;p class=\"hello\"&gt;\r\nthere, how are you\r\n&lt;/p&gt;

I need to match "hi" and "there, how are you". Note that I need to match text that is not between tags as well, "hi", in this example.

UPDATE: since I am using ruby's gsub, it looks like I cannot even use *SKIP and *F

UPDATE 2: I was trying not to get into much detail but seems to be important: I actually need to replace all the spaces from a text, but not those spaces that are part of a tag, be it a &lt; ... &gt; tag or a <...> tag.


Solution

  • You can use

    text = text.gsub(/(&lt;.*?&gt;|<[^>]*>)|[[:blank:]]/m) { $1 || '_' }
    

    I suggest [[:blank:]] instead of \s since I assume you do not want to replace line breaks. See the Ruby demo.

    The regex above matches

    • (&lt;.*?&gt;|<[^>]*>) - either &lt;, any zero or more chars as few as possible, and &gt; or <, then zero or more chars other than > and then a >
    • | - or
    • [[:blank:]] - any single horizontal whitespace (you may also use [\p{Zs}\t] to match any Unicode horizontal whitespace).

    The { $1 || '_' } block in the replacement means that when Group 1 matches, the Group 1 value is returned as is, else, _ is returned as a replacement of a horizontal whitespace.