Search code examples
javastringsurrogate-pairs

How to remove surrogate characters in Java?


I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database.

I have written the following method for now and I am curious to know if there is a direct and optimal way to handle this.

Thanks in advance for your help.

public static String removeSurrogates(String query) {
    StringBuffer sb = new StringBuffer();
    for (int i = 0; i < query.length() - 1; i++) {
        char firstChar = query.charAt(i);
        char nextChar = query.charAt(i+1);
        if (Character.isSurrogatePair(firstChar, nextChar) == false) {
            sb.append(firstChar);
        } else {
            i++;
        }
    }
    if (Character.isHighSurrogate(query.charAt(query.length() - 1)) == false
            && Character.isLowSurrogate(query.charAt(query.length() - 1)) == false) {
        sb.append(query.charAt(query.length() - 1));
    }

    return sb.toString();
}

Solution

  • Here's a couple things:

    • Character.isSurrogate(char c):

      A char value is a surrogate code unit if and only if it is either a low-surrogate code unit or a high-surrogate code unit.

    • Checking for pairs seems pointless, why not just remove all surrogates?

    • x == false is equivalent to !x

    • StringBuilder is better in cases where you don't need synchronization (like a variable that never leaves local scope).

    I suggest this:

    public static String removeSurrogates(String query) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < query.length(); i++) {
            char c = query.charAt(i);
            // !isSurrogate(c) in Java 7
            if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
                sb.append(firstChar);
            }
        }
        return sb.toString();
    }
    

    Breaking down the if statement

    You asked about this statement:

    if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
        sb.append(firstChar);
    }
    

    One way to understand it is to break each operation into its own function, so you can see that the combination does what you'd expect:

    static boolean isSurrogate(char c) {
        return Character.isHighSurrogate(c) || Character.isLowSurrogate(c);
    }
    
    static boolean isNotSurrogate(char c) {
        return !isSurrogate(c);
    }
    
    ...
    
    if (isNotSurrogate(c)) {
        sb.append(firstChar);
    }