Search code examples
javaencodingcharacter-encoding

Polish characters in file java


I using in my project Jsoup. I read docx file and convert it to html. I want write results in file, but I have problem. FileOutputStream not write polish characters. For example instead of

Wiersz nad którym znajduje się aktualnie kursor myszy I have

Wiersz nad kt?rym znajduje si� aktualnie kursor myszy . 

This is my method where I parse html:

public String parseHTML(String html) {
    int i = 0;
    Document doc = Jsoup.parse(html);
    doc.outputSettings().syntax(Document.OutputSettings.Syntax.xml).charset("ISO-8859-2");
    for (Element element : doc.select("img[src]")) {
        element.attr("src", "resources/images/img" + i + ".png");
        i++;
    }
    return doc.toString();
}

and here I write to file:

public void saveHelpFile(byte[] document) throws IOException {
    File file = new File(
            "path/to/file");
    String s = new String(document, "ISO-8859-2");
    PrintWriter writer = new PrintWriter(file, "ISO-8859-2");
    try {
        writer.write(s);
    } finally {
        writer.close();
    }
}

Here is my method where I read file:

public void uploadFile() throws XWPFConverterException, IOException {
        InputStream in = new FileInputStream(new File("path/to/file"));
        XWPFDocument document = new XWPFDocument(in);

        XHTMLOptions options = XHTMLOptions.create();
        XHTMLConverter.getInstance().convert(document, out, options);

        String html = out.toString();
        html = html.replaceAll("<html>",
                "<html xmlns='http://www.w3.org/1999/xhtml' " + "\n" + " xmlns:h='http://java.sun.com/jsf/html' " + "\n"
                        + " xmlns:f='http://java.sun.com/jsf/core' " + "\n" + " xmlns:p='http://primefaces.org/ui ' "
                        + "\n" + " xmlns:ui='http://java.sun.com/jsf/facelets' " + "\n"
                        + " xmlns:pe='http://primefaces.org/ui/extensions' " + "\n"
                        + " xmlns:components='http://java.sun.com/jsf/composite/components' >");

        html = parseHTML(html, extractPhoto(document));
        html = html.replaceAll("<body>", "<h:body>").replaceAll("</body>", "</h:body>");
        saveHelpFile(html.getBytes("ISO-8859-2"));
    }

Solution

  • Your String is fine, it contains correct info, but when you write to file you write it with charset "ISO-8859-2". File doesn't keep the charset info it is written with. Whatever app reads the file it is expected to know or guess the charset of the file. That's why it is always recommended to write your files in UTF-8 or UTF-16. So, in your code no change is needed as far as getting your String. Just when you write to file change the charset to UTF-8. The reason that it will work is that you "told" your String that your bytes represent info in charset "ISO-8859-2" and should be interpreted as such. So the String is built correctly. But internally java keeps all Srtings in Unicode charset (UCS-2). So now you can write your String to any other destination (file in your case) in any valid charset and Java will know how to write it. So in your case you can write it in "ISO-8859-2" or in "UTF-8" or any other charset that supports Polish (for instance "UTF-16") Since UTF-8 is generally accepted de-facto standard it is recommended to use it