Search code examples
.netregexhtml-to-text

HOW TO Convert HTML to plain-text while retaining Tabs and other valid plain-text layout


WRT this solution, pleas how can we adapt it to retain tabs and other valid plain-text layout

Referenced solution:

 public static string StripHTML(string HTMLText, bool decode = true)
        {
            Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            var stripped = reg.Replace(HTMLText, "");
            return decode ? HttpUtility.HtmlDecode(stripped) : stripped;
        }

Solution

  • I'm not sure what you mean, it does preserve tabs and newlines

    void Main()
    {
        var html = "<html>\n\t<body>\n\t\tBody text!\n\t</body>\n</html>";
    
        StripHTML(html).Dump(); //Prints "\n\t\n\t\tBody text!\n\t\n
    }
    
    public static string StripHTML(string HTMLText, bool decode = true)
    {
      Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
      var stripped = reg.Replace(HTMLText, "");
        return decode ? HttpUtility.HtmlDecode(stripped) : stripped;
    }