Search code examples
.netregexvb.netregexp-replace

RegEx Replace with character substitution in captured group


I can get the string of my interest using regex, but how do I replace it with a character substituted in the capture?

I want to remove the > character from inside any html attribute, or replace it with >.

Sample original string

<html> 
<head></head> 
<body> 
<div  sometag="abc>def" onclick="myfn()" class='xyz'>
Dear {@CustomerName},
blah blah blah
</div></body> 
</html>

Desired result

<html> 
<head></head> 
<body> 
<div  sometag="abc&gt;def" onclick="myfn()" class='xyz'>
Dear {@CustomerName},
blah blah blah
</div></body> 
</html>

I'm using the following regex pattern and replacement

Pattern: \s\w+\s*=\s*(['"])[^\1]+?\1

Replacement: -- don't know! what should I use? --

This is my vb.net code (just in case if it helps)

Dim reAttr As New Regex("\s\w+\s*=\s*(['""])[^\1]+?\1", RegexOptions.Singleline)
result = reAttr.Replace(text, Replace("$&", ">", ""))

Solution

  • You can use

    Dim reAttr As New Regex("\s\w+\s*=\s*(['""])(?:(?!\1).)*?\1", RegexOptions.Singleline)
    Dim result = reAttr.Replace(text, New MatchEvaluator(Function(m As Match)
             Return m.Value.Replace(">", "-")
         End Function))
    

    Note that [^\1] is not doing what you expect, it matches any char but a SOH char (\x01). The (?:(?!\1).)*? tempered greedy token does what you wanted, it matches any char, other than the value captured in Group 1, 0 or more times, as few times as possible.

    The MatchEvaluator is used as the replacement arguments where you may access the whole match value with m.Value.