Search code examples
javahtmldocxdocx4j

How to handle special characters when converting from HTML to DocX


I have a application that converts html files to DocX using DocX4J. I´m having problems with special characters like ç,á,é,í,ã,etc. My text font in the html files is Arial but when I convert them to DocX the special characters mentioned before are set to calibri font. So, in the same word (e.g Cláudio), I have "Cl" written in Arial font, "á" character in Calibri font and "udio" in Arial font.

I saw that maybe I have to set font property in w:r but I´m having difficulty to see how to do it to all runs of my text been converted. Also, I can´t see how to do it in my conversion code, that is listed below (with a sample html).

Any tip or suggestion about how to do this conversion and handle those special characters would be really great.

Cheers.

public WordprocessingMLPackage export(String xhtml) {

WordprocessingMLPackage wordMLPackage = null;
try {
    wordMLPackage = WordprocessingMLPackage.createPackage();
    XHTMLImporter importer = new XHTMLImporterImpl(wordMLPackage);
    List<Object> content = importer.convert(xhtml,null);
    wordMLPackage.getMainDocumentPart().getContent().addAll(content);
}
catch (Docx4JException e) {
    // ...
}
return wordMLPackage;
}

<html>
<head>
<meta charset="ISO-8859-1" />
<style type="text/css">
h1 {
    page-break-before: always;
}

p, h1 {
    font-family: Arial;
    font-size: 12pt;
}

p {
    line-height: 150%;
}

h1 {
    font-weight: bold;
    line-height: 130%
}
</style>
</head>
<body>
    <h1>RESUMO<br /></h1>
<p>
    <span>Um resumo para o relatório.</span><br />
</p>
</body>
</html>

Solution

  • Following the tip given by JasonPlutext, I found an example of how to map a font to the XHTMLImporter at the DocX4J forum (http://www.docx4java.org/forums/docx-java-f6/docx-to-html-and-back-to-docx-t1913.html).

    Now my code is working! See the final version below.


    public WordprocessingMLPackage export(String xhtml) {
    
    WordprocessingMLPackage wordMLPackage = null;
    try {
        RFonts arialRFonts = Context.getWmlObjectFactory().createRFonts();
        arialRFonts.setAscii("Arial");
        arialRFonts.setHAnsi("Arial");
        XHTMLImporterImpl.addFontMapping("Arial", arialRFonts);
    
        wordMLPackage = WordprocessingMLPackage.createPackage();
        XHTMLImporter importer = new XHTMLImporterImpl(wordMLPackage);
        List<Object> content = importer.convert(xhtml,null);
        wordMLPackage.getMainDocumentPart().getContent().addAll(content);
    }
    catch (Docx4JException e) {
        // ...
    }
    return wordMLPackage;
    }