Search code examples
javaapache-poidoc

Apache POI - converting *.doc to *.html with images


There is a DOC file that contains some image. How to convert it to HTML with image?

I tried to use this example: Convert Word doc to HTML programmatically in Java

public class Converter {
    ...

    private File docFile, htmlFile;

    try {
        FileInputStream fos = new FileInputStream(docFile.getAbsolutePath()); 
        HWPFDocument doc = new HWPFDocument(fos);       
        Document newDoc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();

        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDoc) ;
        wordToHtmlConverter.processDocument(doc);

        StringWriter stringWriter = new StringWriter();

        Transformer transformer = TransformerFactory.newInstance().newTransformer();        
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
        transformer.setOutputProperty(OutputKeys.METHOD, "html");
        transformer.transform(
                    new DOMSource(wordToHtmlConverter.getDocument()),
                    new StreamResult(stringWriter)
        );

        String html = stringWriter.toString();

        try {
            BufferedWriter out = new BufferedWriter(
                new OutputStreamWriter(new FileOutputStream(htmlFile), "UTF-8")
            );     
            out.write(html);
            out.close();
       } catch (IOException e) {
           e.printStackTrace();
       }

       JEditorPane jEditorPane = new JEditorPane();
       jEditorPane.setContentType("text/html");
       jEditorPane.setEditable(false);
       jEditorPane.setPage(htmlFile.toURI().toURL());

       JScrollPane jScrollPane = new JScrollPane(jEditorPane);

       JFrame jFrame = new JFrame("display html file");
       jFrame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
       jFrame.getContentPane().add(jScrollPane);
       jFrame.setSize(512, 342);
       jFrame.setVisible(true);

    } catch(Exception e) {
        e.printStackTrace();
    }
    ...
}

But the image is lost.

The documentation for the WordToHtmlConverter class says the following:

...this implementation doesn't create images or links to them. This can be changed by overriding AbstractWordConverter.processImage(Element, boolean, Picture) method.

How to convert DOC to HTML with images?


Solution

  • Your best bet in this case is to use Apache Tika, and let it wrap Apache POI for you. Apache Tika will generate HTML for your document (or plain text, but you want the HTML for your case). Along with that, it'll put in placeholders for embedded resources, img tags for embedded images, and provide you with a way to get at the contents of the embedded resources and images.

    There's a very good example of doing this included in Alfresco, HTMLRenderingEngine. You'll likely want to review the code there, then write your own to do something very similar. The code there includes a custom ContentHandler which allows editing of the img tags, to re-write the src attributes, you may or may not need that depending on where you're going to write out the images to.