Search code examples
c#regexrsssyndicationfeed

SyndicationFeed - item summary (RSS description) - extract only text from it


I’m using the SyndicationFeed class to consume some rss feeds for articles. I wonder how to get only the text from the item's Summary field, without the html tags. for example, sometimes (not always) it contains html tags such as: div, img, h, p tags:/a>/div> ,img src='http"

I want to get rid of all tags. Also, I'm not sure it brings the full description within the RSS feed.

Should I use regular expression for this matter? other methods?

XmlReader reader = XmlReader.Create(response.GetResponseStream());

SyndicationFeed feed = SyndicationFeed.Load(reader);

foreach (SyndicationItem item in feed.Items)
{

     string description= item.Summary;  //This contains tags and not only the article text

}

Solution

  • Yeah I suppose regexes are the easiest built-in way to achieve this...

    // Get rid of the tags
    description = Regex.Replace(description, @"<.+?>", String.Empty);
    
    // Then decode the HTML entities
    description = WebUtility.HtmlDecode(description);