Search code examples
javaxpathxml-parsingsaxtag-soup

Using a SAX parser when I need a DocumentBuilder


XMLBeam is a nice XML to POJO unmarshaler (via XPath), but it only allows you to configure a DocumentBuilder or DocumentBuilderFactory.

TagSoup is a nice SAX parser that lets you parse nasty HTML documents as though they were XML.

I would like to use TagSoup as the XML parser for XMLBeam, so that I can unmarshal nasty HTML to POJOs using XPath.

Is there a way to convert or wrap a SAX parser, so that I can use it as a DocumentBuilder or DocumentBuilderFactory?


Solution

  • You can wrap SAX in a DocumentBuilder. XMLBeam only uses the parse(InputSource) method of DocumentBuilder, so it's pretty simple:

    import org.ccil.cowan.tagsoup.Parser;
    import org.w3c.dom.DOMImplementation;
    import org.w3c.dom.Document;
    import org.xml.sax.*;
    
    import javax.xml.parsers.DocumentBuilder;
    import javax.xml.transform.Transformer;
    import javax.xml.transform.TransformerFactory;
    import javax.xml.transform.dom.DOMResult;
    import javax.xml.transform.sax.SAXSource;
    import java.io.IOException;
    
    public class MyDocumentBuilder extends DocumentBuilder {
    
        @Override
        public Document parse(InputSource inputSource) throws SAXException, IOException {
    
            XMLReader xmlReader = new Parser();
            xmlReader.setFeature(Parser.namespacesFeature, false);
            xmlReader.setFeature(Parser.namespacePrefixesFeature, false);
    
            try{
                Transformer transformer = TransformerFactory.newInstance().newTransformer();
                DOMResult domResult = new DOMResult();
                transformer.transform(new SAXSource(xmlReader, inputSource), domResult);
                return (Document) domResult.getNode();
            }
            catch(Exception exp){
                throw new RuntimeException("Error parsing with Tagsoup");
            }
        }
    
        @Override
        public void setErrorHandler(ErrorHandler errorHandler) {
    
        }
    
        @Override
        public Document newDocument() {
            return null;
        }
    
        @Override
        public void setEntityResolver(EntityResolver entityResolver) {
    
        }
    
        @Override
        public boolean isValidating() {
            return false;
        }
    
        @Override
        public DOMImplementation getDOMImplementation() {
            return null;
        }
    
        @Override
        public boolean isNamespaceAware() {
            return false;
        }
    }
    

    Then, elsewhere you can tell XMLBeam to use your DocumentBuilder:

        XMLFactoriesConfig xmlFactoriesConfig = new DefaultXMLFactoriesConfig(){
            @Override
            public DocumentBuilder createDocumentBuilder() {
                return new MyDocumentBuilder();
            }
        };
    
        XBProjector xbProjector = new XBProjector(xmlFactoriesConfig);