Search code examples
javadomxpathinputstreamtransformer-model

Java transformer w3c.dom.document to inputstream


My scenario is this:

I have a HTML which I loaded into a w3c.dom.Document, after loading it as a doc, I parsed through its nodes and made a few changes in their values, but now I need to transform this document into a String, or preferably into a InputStream directly.

And I managed to do so, however, to the ends I need this HTML it must keep some properties of the initial file, for instance (and this is the one thing I'm struggling a lot trying to solve), all tags must be closed.

Say, I have a link tag on the header, <link .... /> I NEED the dash (/) at the end. However after the transformer transform my doc into a outputStream (which then I proceed to send to an inputStream) all the '/' before the > disappear. All my tags, which ended in /> are changed into simple >.

The reason I need this structure is that one of the libraries I'm using (and I'm afraid I can't go looking for another one, specially not at this point) require all tags to be closed, if not it throws exceptions everywhere and my program crashes....

Does anyone have any good ideas or solutions for me? This is my first contact with the Transform class, so I might be missing something that could help me.

Thank you all so very much,

Warm regards

Some bit of the code to explain the scenario a little bit

DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
org.w3c.dom.Document doc = docBuilder.parse(his); // his = the HTML inputStream

XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "//*[@id='pessoaNome']";
org.w3c.dom.Element pessoaNome = null;

try 
{
    pessoaNome = (org.w3c.dom.Element) (Node) xPath.compile(expression).evaluate(doc, XPathConstants.NODE);
} 
catch (Exception e) 
{
    e.printStackTrace();
}

pessoaNome.setTextContext("The new values for the node");
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Source xmlSource = new DOMSource(doc);
Result outputTarget = new StreamResult(outputStream);

Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, "HTML");
transformer.transform(xmlSource, outputTarget);
InputStream is = new ByteArrayInputStream(outputStream.toByteArray()); // At this point outputStream is already all messed up, not just the '/'. but this is the only thing causing me problems

as @Lee pointed out, I changed it to use Jsoup. Code got a lot cleaner, just had to set up the outputSettings for it to work like a charm. Code below

org.jsoup.nodes.Document doc = Jsoup.parse(new File(HTML), "UTF-8");

org.jsoup.nodes.Element pessoaNome = doc.getElementById("pessoaNome");

pessoaNome.html("My new html in here");

OutputSettings oSettings = new OutputSettings();
oSettings.syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml);
doc.outputSettings(oSettings);
InputStream is = new ByteArrayInputStream(doc.outerHtml().getBytes());

Solution

  • Have a look at jTidy which cleans HTML. There is also jsoup which is newer as supposedly does the same things only better.