Search code examples
javajakarta-eeencodingmultipart

Incorrect encoding from uploaded text file


I'm working on an JavaEE application, which uploads text files to the server, to process their content. The user's text sources can vary greatly, especially their encoding.

I'd like to convert everything to UTF-8 (persistance is coming) but first, I'd need to read it correctly.

I'm using InputStreamReader's getEncoding() method :

public void doThings(HttpServletRequest request) {
    Part file = request.getPart("formfile");
    InputStreamReader isr = new InputStreamReader(file.getInputStream());

    // BUT THIS ALWAYS prints "UTF8" whatever the text file's encoding is :
    System.out.println( isr.getEncoding() );
}

I actually use InputStream because the app later uses Scanner class and delimiters to chop the data up, but if something else is the way to go, I'm not bound to it in any way...

Thanks for any pointers


Solution

  • You would need to do new InputStreamReader(file.getInputStream(), charsetOfFile); otherwise it defaults to the platform of the application, evidently UTF-8.

    There is no reliable way to extract the encoding/charset. The headers are not decisive, part.getContentType() is also more guess type. Maybe the charset if it appears might be a first indicator.

    Replace charset ISO-8859-1 (Latin-1) by Windows-1252 (Windows Latin-1), as all browsers interprete ISO-8859-1 as Windows-1252.

    Windows-1252 is also a good default (as ISO-8859-1 is the HTTP default too).

    If the file content conforms to the multibyte UTF-8 format, take that. UTF-8 validation.

    Charset detection is implemented by some libraries. I made my own detection, incomplete, using language detection (by frequency lists).

    For charset detection read the file as binary data, bytes, without InputStreamReader.