Search code examples
regexrssfeedxmlreaderrss-reader

Regex - Replace is too slow


On my Rss Feeds reading system I need to remove any existent script block because some people say this confuse XmlReader.

For that I'm doing this piece of code that I found on web:

allXml = Regex.Replace(allXml, "(.*)<script type='text/javascript'>.+?</script>(.*)", "$1$2");

But this is too slow... There is any way to perform this? I already tried to do the Match first but this is equally slow, like this:

Match rgx = Regex.Match(allXml, "(.*)<script type='text/javascript'>.+?</script>(.*)");

if (rgx.Success)
    allXml = Regex.Replace(allXml,"(.*)<script type='text/javascript'>.+?</script>(.*)","$1$2");

Solution

  • The first (.*) grabs the whole line at once (since * is a greedy quantifier), and then starts backtracking trying to accommodate all the subsequent patterns. If your string is a very long line, several megabytes long, it might be problematic for the engine, as it will have to perform a lot of steps before it finds the appropriate string chunks for each capturing group defined in the pattern.

    If you want a regex quick and dirty fix, discard the (.*)s, and just use

    string res = Regex.Replace(allXml, "(?s)<script type='text/javascript'>.*?</script>", "")
    

    to remove the <script>...</script> substrings. Note the (?s) is the RegexOptions.Singleline (DOTALL) modifier inline equivalent, so that . could match newline symbols, too.

    However, the best way is to use an HTML parser, like HtmlAgilityPack.