java string performance uppercase lowercase

Why is String.toUpperCase() so slow?

This code is about 3 times faster than the standard String.toUpperCase() function:

public static String toUpperString(String pString) {
    if (pString != null) {
        char[] retChar = pString.toCharArray();
        for (int idx = 0; idx < pString.length(); idx++) {
            char c = retChar[idx];
            if (c >= 'a' && c <= 'z') {
                retChar[idx] = (char) (c & -33);
            }
        }
        return new String(retChar);
    } else {
        return null;
    }
}

Why is it so much faster? What other work is String.toUpperCase() also doing? In other words, are there cases in which this code will not work?

Benchmark results for a random long string (plain text) executed 2,000,000 times:

toUpperString(String) : 3514.339 ms - about 3.5 seconds
String.toUpperCase() : 9705.397 ms - almost 10 seconds

** UPDATE

I have added the "latin" check and used this as benchmark (for those who don't believe me):

public class BenchmarkUpperCase {

    public static String[] randomStrings;

    public static String nextRandomString() {
        SecureRandom random = new SecureRandom();
        return new BigInteger(500, random).toString(32);
    }

    public static String customToUpperString(String pString) {
        if (pString != null) {
            char[] retChar = pString.toCharArray();
            for (int idx = 0; idx < pString.length(); idx++) {
                char c = retChar[idx];
                if (c >= 'a' && c <= 'z') {
                    retChar[idx] = (char) (c & -33);
                } else if (c >= 192) { // now catering for other than latin...
                    retChar[idx] = Character.toUpperCase(c);
                }
            }
            return new String(retChar);
        } else {
            return null;
        }
    }

    public static void main(String... args) {
        long timerStart, timePeriod = 0;
        randomStrings = new String[1000];
        for (int idx = 0; idx < 1000; idx++) {
            randomStrings[idx] = nextRandomString();
        }
        String dummy = null;

        for (int count = 1; count <= 5; count++) {
            timerStart = System.nanoTime();
            for (int idx = 0; idx < 20000000; idx++) {
                dummy = randomStrings[idx % 1000].toUpperCase();
            }
            timePeriod = System.nanoTime() - timerStart;
            System.out.println(count + " String.toUpper() : " + (timePeriod / 1000000));
        }

        for (int count = 1; count <= 5; count++) {
            timerStart = System.nanoTime();
            for (int idx = 0; idx < 20000000; idx++) {
                dummy = customToUpperString(randomStrings[idx % 1000]);
            }
            timePeriod = System.nanoTime() - timerStart;
            System.out.println(count + " customToUpperString() : " + (timePeriod / 1000000));
        }
    }

}

I get these results:

1 String.toUpper() : 10724
2 String.toUpper() : 10551
3 String.toUpper() : 10551
4 String.toUpper() : 10660
5 String.toUpper() : 10575
1 customToUpperString() : 6687
2 customToUpperString() : 6684
3 customToUpperString() : 6686
4 customToUpperString() : 6693
5 customToUpperString() : 6710

Which is still about 60% faster.

Solution

Examining the source code for java.lang.String is instructive:

The standard version goes to considerable lengths to avoid creating a new string when it doesn't have to. This entails making two passes over the string.
The standard version uses a locale object to do the case conversion for all characters. You are only doing that for characters greater than 192. While that probably works for common locales, it is possible that some locales (current or future ... or custom) will have "interesting" capitalization rules that apply to characters less than 192 as well.
The standard version is doing a proper job of converting to uppercase by Unicode code-point rather than code-unit. (Converting by code-unit is liable to break or give the wrong answer if the string contains surrogate characters.)

The penalty for "doing it correctly" is that the standard version of toUppercase is slower than your version¹. But it will give the correct answer in cases where your version won't.

Note that since you are testing on strings that are ASCII, you won't encounter the cases where your version of toUppercase gives the wrong answer.

^{1 - According to your benchmark ... but see the other answers!}