Search code examples
javatomcatencodingnormalize

Encoding artefacts when normalizing a String in Java


My website allows its users to upload a file on a remote server. To avoid trouble with filenames on the server, I want to apply a simple rule to name the uploaded files on the server:

  1. replace all accented letters (à, é, è etc.) by their unaccented equivalent (ie a, e, e in our example)
  2. replace all special characters by a underscore
  3. lowercase the whole thing

My code looks like

protected String serverFilename(String localFilename) {
    if (localFilename == null || localFilename.length() == 0) {
        throw new IllegalArgumentException("Invalid filename for upload (localFilename=" + localFilename + ")");
    }

    String result = Normalizer.normalize(localFilename, Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "").replaceAll("[^a-zA-Z0-9.]", "_").toLowerCase();
    LOG.debug("filename " + localFilename + " returns: " + result);
    return result;
}

This unit test runs just fine:

assertEquals("capture_d_ecran_2012_08_02_a_12.45.29.png", uploader.serverFilename("Capture d’écran 2012-08-02 à 12.45.29.png"));

But in real action, ie in Tomcat 6 running locally on a Mac Server, when a file with a similar filename, I get a filename called 'capture_d_ao__cran_2012_07_10____10.22.01.png':

filename Capture d’écran 2012-07-10 à 10.22.01.png returns: capture_d_ao__cran_2012_07_10____10.22.01.png

I guess there's some sort of mis-encoding somewhere but I don't have any idea where. Any tips on how i can fix this?

UPDATE: both the Java source file and the HTML responsible for uploading the file are UTF-8 encoded.


Solution

  • I'm guessing that the Java source files are saved with a different encoding (the default on Macs tends to be MacRoman, but you should always use UTF-8 everywhere) than the HTTP request encoding.

    Copypasta'd at OP's request.