Search code examples
c#htmlvalidation

How to validate that a string doesn't contain HTML using C#


Does anyone have a simple, efficient way of checking that a string doesn't contain HTML? Basically, I want to check that certain fields only contain plain text. I thought about looking for the < character, but that can easily be used in plain text. Another way might be to create a new System.Xml.Linq.XElement using:

XElement.Parse("<wrapper>" + MyString + "</wrapper>")

and check that the XElement contains no child elements, but this seems a little heavyweight for what I need.


Solution

  • I just tried my XElement.Parse solution. I created an extension method on the string class so I can reuse the code easily:

    public static bool ContainsXHTML(this string input)
    {
        try
        {
            XElement x = XElement.Parse("<wrapper>" + input + "</wrapper>");
            return !(x.DescendantNodes().Count() == 1 && x.DescendantNodes().First().NodeType == XmlNodeType.Text);
        }
        catch (XmlException ex)
        {
            return true;
        }
    }
    

    One problem I found was that plain text ampersand and less than characters cause an XmlException and indicate that the field contains HTML (which is wrong). To fix this, the input string passed in first needs to have the ampersands and less than characters converted to their equivalent XHTML entities. I wrote another extension method to do that:

    public static string ConvertXHTMLEntities(this string input)
    {
        // Convert all ampersands to the ampersand entity.
        string output = input;
        output = output.Replace("&amp;", "amp_token");
        output = output.Replace("&", "&amp;");
        output = output.Replace("amp_token", "&amp;");
    
        // Convert less than to the less than entity (without messing up tags).
        output = output.Replace("< ", "&lt; ");
        return output;
    }
    

    Now I can take a user submitted string and check that it doesn't contain HTML using the following code:

    bool ContainsHTML = UserEnteredString.ConvertXHTMLEntities().ContainsXHTML();
    

    I'm not sure if this is bullet proof, but I think it's good enough for my situation.