Search code examples
c#htmlregexstreamreader

Need data inside the Body tag, but no any other tag


Hi I have Resume in the html format, I am reading file using StreamReader ,and I am removing tags using below method.

using (StreamReader sr = new StreamReader("\\Myfile.html"))
                {
                    String line = sr.ReadToEnd();
                    string jj = Regex.Replace(line, "<.*?>", String.Empty);
    }

Its working Damn Cool

But however as per my requirement I need the data only inside the body tag. but no body tag, and with no tags inside.


Solution

  • Don't use Regex for HTML/XML parsing. Use Html/Xml parser. Here is explain well why you should not use it.

    RegEx match open tags except XHTML self-contained tags

    Can you provide some examples of why it is hard to parse XML and HTML with a regex?

    You can load the string in Html document using HTML Agility pack

    Here little example of how to do it:

    public string ReplacePElement() 
    {
        HtmlDocument doc = new HtmlDocument();
        doc.Load(htmlFile);
    
        foreach(HtmlNode p in doc.DocumentNode.SelectNodes("body"))
        {
    
        }
    
        return doc.DocumentNode.OuterHtml;
    }