Search code examples
c#.nethtmlregexstrip

Replacing <p>, <div> tags within <td> tags?


I'm working on a specialized HTML stripper. The current stripper replaces <td> tags with tabs then <p> and <div> tags with double carriage-returns. However, when stripping code like this:

<td>First Text</td><td style="background:#330000"><p style="color:#660000;text-align:center">Some Text</p></td>

It (obviously) produces

First Text

Some Text

We'd like to have the <p> replaced with nothing in this case, so it produces:

First Text (tab) Some Text

However, we'd like to keep the double carriage-return replacement for other code where the <p> tag is not surrounded by <td> tags.

Basically, we're trying to replace <td> tags with \t always and <p> and <div> tags with \r\r ONLY when they're not surrounded by <td> tags.

Current code: (C#)

  // insert tabs in places of <TD> tags
  result = System.Text.RegularExpressions.Regex.Replace(result,
           @"<td\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\t",
           System.Text.RegularExpressions.RegexOptions.IgnoreCase);  

  // insert line paragraphs (double line breaks) in place
  // of <P>, <DIV> and <TR> tags
  result = System.Text.RegularExpressions.Regex.Replace(result,
           @"<(div|tr|p)\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\r\r",
           System.Text.RegularExpressions.RegexOptions.IgnoreCase);

(there's more code to the stripper; this is the relevant part)

Any ideas on how to do this without completely rewriting the entire stripper?

EDIT: I'd prefer to not use a library due to the headaches of getting it signed off on and included with the project (which itself is a library to be included in another project), not to mention the legal issues. If there is no other solution, though, I'll probably use the HTML Agility Pack.

Mostly, the stripper just strips out anything it finds that looks like a tag (done with a large regex based on a regex in Regular Expressions Cookbook. This, replacing line break tags with /r, and dealing with multiple tabs is the brunt of the custom stripping code.


Solution

  • Found the answer:

      // remove p/div/tr inside of td's
      result = System.Text.RegularExpressions.Regex.Replace(result, @"<td\b(?:[^>""']|""[^""]*""|'[^']*')*>.*?</td\b(?:[^>""']|""[^""]*""|'[^']*')*>", new MatchEvaluator(RemoveTagsWithinTD));
    

    This code calls this separate method for each match:

      //a separate method
      private static string RemoveTagsWithinTD(Match matchResult) {
          return Regex.Replace(matchResult.Value, @"<(div|tr|p)\b(?:[^>""']|""[^""]*""|'[^']*')*>", "");
        }
    

    This code was (again) based on another recipe from the Regular Expressions Cookbook (which was sitting in front of me the whole time, d'oh!). It's really a great book.