Search code examples
regexcoldfusionhtml-parsingcoldfusion-8

How to get string of everything between these two em tag?


I want to get string between em tag , including other html also.

for example:

<em>UNIVERSALPOSTAL UNION - International Bureau Circular<br />
By: K.J.S. McKeown</em>

output should be as:

UNIVERSALPOSTAL UNION - International Bureau Circular<br />
    By: K.J.S. McKeown

please help me.

Thanks


Solution

  • Use the regular expression function like this:

    REMatch("(?s)<em>.*?</em>", html)
    

    See also: http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=regexp_01.html

    • The (?s) sets the mode to single line, so that the input text is interpreted as one line even if it contains line feeds. This is probably the default (I'm not sure) so it can be omitted. As Peter pointed out in a comment, this is not the default and therefore must be set.

    • The .*? matches all characters inbetween <em> and </em>. The questionmark after the multiplier makes it "non-greedy", so that as few as possible characters are matched. This is needed in case the input html contains something like <em>foo</em><em>bar</em> where otherwise only the outermost <em></em> tags are considered.

    • The returned array contains all matches found, i.e. all texts including html that was in <em> tags.

    Note that this could fail for circumstances where </em> also occurs as attribute text and is incorrectly not html-encoded, for example: <em><a title="Help for </em> tag">click</a></em> or in other rare circumstances (e.g. javascript script tags etc.). A regex cannot replace a full HTML/XML parser and if you need 100% accurateness, you should consider using one: http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=functions_t-z_23.html