On my Rss Feeds reading system I need to remove any existent script block because some people say this confuse XmlReader.
For that I'm doing this piece of code that I found on web:
allXml = Regex.Replace(allXml, "(.*)<script type='text/javascript'>.+?</script>(.*)", "$1$2");
But this is too slow... There is any way to perform this? I already tried to do the Match first but this is equally slow, like this:
Match rgx = Regex.Match(allXml, "(.*)<script type='text/javascript'>.+?</script>(.*)");
if (rgx.Success)
allXml = Regex.Replace(allXml,"(.*)<script type='text/javascript'>.+?</script>(.*)","$1$2");
The first (.*)
grabs the whole line at once (since *
is a greedy quantifier), and then starts backtracking trying to accommodate all the subsequent patterns. If your string is a very long line, several megabytes long, it might be problematic for the engine, as it will have to perform a lot of steps before it finds the appropriate string chunks for each capturing group defined in the pattern.
If you want a regex quick and dirty fix, discard the (.*)
s, and just use
string res = Regex.Replace(allXml, "(?s)<script type='text/javascript'>.*?</script>", "")
to remove the <script>...</script>
substrings. Note the (?s)
is the RegexOptions.Singleline
(DOTALL) modifier inline equivalent, so that .
could match newline symbols, too.
However, the best way is to use an HTML parser, like HtmlAgilityPack.