Search code examples
javaregexfilenamessanitizationreplaceall

How do I replace illegal characters in a filename?


I am trying to create a zip with folders inside it and I have to sanitize the folder names against any illegal characters. I did some googling around and found this method from http://www.rgagnon.com/javadetails/java-0662.html:

public static String sanitizeFilename(String name) {
    return name.replaceAll("[\\\\/:*?\"<>|]", "-");
}

However, upon testing I get some weird results. For example:

name = filename£/?e>"e

should return filename£--e--e from my understanding. But instead it returns filename-ú--e--e

Why is this so?

Please note that I am testing this by opening the downloaded zip file in WinZip and looking at the folder name that is created. I can't get the pound sign to appear. I've also tried this:

public static String sanitizeFilename(String name) {
    name = name.replaceAll("[£]", "\u00A3");
    return name.replaceAll("[\\\\/:*?\"<>|]", "-");
}

EDIT: Some more research and I found this: http://illegalargumentexception.blogspot.co.uk/2009/04/i18n-unicode-at-windows-command-prompt.html It appears to do with Locale, windows versions and encoding factors. Not sure how I can overcome this within the code.


Solution

  • I think it depends on how you are actually reading the file name in terms of encoding.

    Therefore, the £ symbol might get corrupted.

    As an example not fitting your case exactly, reading UTF-8-encoded £ as an ISO Latin 1-encoded character would return £.

    Make sure of the file's encoding (i.e. ISO Latin 1 vs UTF-8 would be the most common), then use the appropriate parameter for your Reader.

    As a snippet, you may want to consider this example:

    BufferedReader br = new BufferedReader(
        new InputStreamReader(
            new FileInputStream(new File("yourTextFile")), 
            "[your file's encoding]"
        )
    );