Search code examples
javahtmlxmlcolcolgroup

java xml parser Exception: The end-tag for element type "col" must end with a '>' delimiter


I want to parse some Html string to org.w3c.dom.Document, I use this method:

public static Document stringToDocument(String input){
    try {
        DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
        InputSource is = new InputSource();
        is.setCharacterStream(new StringReader(input));
        Document doc = db.parse(is);
        return doc;
    }catch (Exception e){
        e.printStackTrace();
        return null;
    }
}

that's work fine on most html, except for html string has "colgroup" and "col" tag (like the following)

<html dir="rtl"><head><meta charset="utf-8"/></head>
<body>
<table>
<colgroup>
<col width="29">
<col style="width:54pt" span="4" width="72">
<col width="4">
</colgroup>
<tbody>
<tr>
<td>test</td>
<td>105</td>
<td>110</td>
</tr>
<tr>
<td>456</td>
<td>456</td>
<td>786</td>
</tr>
</tbody>
</table>
</body>
</html>

Exception is thrown by method is:

org.xml.sax.SAXParseException; lineNumber: 8; columnNumber: 6; The end-tag for element type "col" must end with a '>' delimiter.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)

According to the w3schools, col tag syntax is correct and I don't know how to solve this problem.


Solution

  • The problem is that HTML is not in XML format. See here http://courses.cs.vt.edu/~cs1204/XML/htmlVxml.html or here http://www.xmlobjective.com/what-is-the-difference-between-xml-and-html/ or here https://webkit.org/blog/68/understanding-html-xml-and-xhtml/ or use you favorite search engine and search for: xml vs html

    Btw. If you really want to parse HTML, you could use third party libraries like https://jsoup.org/ or http://htmlcleaner.sourceforge.net/