Search code examples
javahtml-parsingwell-formednon-well-formed

How can I determine if a HTML document is well formed or not in JAVA?


Heyy guys, I need to determine if a given HTML Document is well formed or not.
I just need a simple implementation using only Java core API classes i.e. no third party stuff like JTIDY or something. Thanks.

Actually, what is exactly needed is an algorithm that scans a list of TAGS. If it finds an open tag, and the next tag isn't its corresponding close tag, then it should be another open tag which in turn should have its close tag as the next tag, and if not it should be another open tag and then its corresponding close tag next, and the close tags of the previous open tags in reverse order coming next on the list. I've already written methods to convert a tag to a close tag. If the list conforms to this order then it returns true or else false.

Here is the skeleton code of what I've started working on already. Its not too neat, but it should give you guys a basic idea of what I'm trying to do.

public boolean validateHtml(){

    ArrayList<String> tags = fetchTags();
    //fetchTags returns this [<html>, <head>, <title>, </title>, </head>, <body>, <h1>, </h1>, </body>, </html>]

    //I create another ArrayList to store tags that I haven't found its corresponding close tag yet
    ArrayList<String> unclosedTags = new ArrayList<String>();

    String temp;

    for (int i = 0; i < tags.size(); i++) {

        temp = tags.get(i);

        if(!tags.get(i+1).equals(TagOperations.convertToCloseTag(tags.get(i)))){
            unclosedTags.add(tags.get(i));
            if(){

            }

        }else{
            return true;//well formed html
        }
    }

    return true;
}

Solution

  • Yeah string manipulation can seem like a pickle sometimes, you need to do something like

    First copy html into an array

    bool tag = false;
    string str = "";
    List<string> htmlTags = new List();
    
    for(int i = 0; i < array.length; i++)
    { 
      //Check for the start of a tag
      if(array[i] == '<')
      {
        tag == true;
      }
    
      //If the current char is part of a tag start copying
      if(tag)
      {
        str += char;
      }
    
      //When a tag ends add the tag to your tag list
      if(array[i] == '>')
      {
        htmlTags.Add(str);
        str = "";
        tag == false;
      }
    }
    

    Something like this should get you started, you should end up with an array of tags, this is only pseudo code so it wont shouldn't compile