Search code examples
javastringcode-generationidentifier

How to convert arbitrary string to Java identifier?


I need to convert any arbitrary string:

  • string with spaces
  • 100stringsstartswithnumber
  • string€with%special†characters/\!
  • [empty string]

to a valid Java identifier:

  • string_with_spaces
  • _100stringsstartswithnumber
  • string_with_special_characters___
  • _

Is there an existing tool for this task?

With so many Java source refactoring/generating frameworks one would think this should be quite common task.


Solution

  • This simple method will convert any input string into a valid java identifier:

    public static String getIdentifier(String str) {
        try {
            return Arrays.toString(str.getBytes("UTF-8")).replaceAll("\\D+", "_");
        } catch (UnsupportedEncodingException e) {
            // UTF-8 is always supported, but this catch is required by compiler
            return null;
        }
    }
    

    Example:

    "%^&*\n()" --> "_37_94_38_42_10_56_94_40_41_"
    

    Any input characters whatsoever will work - foreign language chars, linefeeds, anything!
    In addition, this algorithm is:

    • reproducible
    • unique - ie will always and only produce the same result if str1.equals(str2)
    • reversible

    Thanks to Joachim Sauer for the UTF-8 suggestion


    If collisions are OK (where it is possible for two inputs strings to produce the same result), this code produces a readable output:

    public static String getIdentifier(String str) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < str.length(); i++) {
            if ((i == 0 && Character.isJavaIdentifierStart(str.charAt(i))) || (i > 0 && Character.isJavaIdentifierPart(str.charAt(i))))
                sb.append(str.charAt(i));
            else
                sb.append((int)str.charAt(i));
        }
        return sb.toString();
    }
    

    It preserves characters that are valid identifiers, converting only those that are invalid to their decimal equivalents.