Search code examples
javaregexunicodecharacter-properties

How to properly write regex for unicode first name in Java?


I need to write a regular expression so I could replace the invalid characters in user's input before sending it further. I think i need to use string.replaceAll("regex", "replacement") to do that. The particular line of code should replace all characters which are not unicode letters. So it's a white list of unicode characters. Basically it's validating and replacing the invalid characters of user's first name.

What I've found so far is this: \p{L}\p{M}, but I'm not sure how to fire it up in regexp so it would work as I explained above. Would this be a regex negation case?


Solution

  • Yes, you need negation. The regular expression would be [^\p{L}] for anything except letters. Another way to write this would be \P{L}.

    \p{M} means "all marks", thus [^\p{L}\p{M}] means **anything which is neither letter nor mark. This also could be written as [\P{L}&&[\P{M}]], but this is not really better.

    In a Java-String all \ have to be doubled, so you would write string.replaceAll("[^\\p{L}\\p{M}]", "replacement") there.


    From a comment:

    By the way, regarding to your answer, what fall in the marks category? Do I even need that? Wouldn't just letters be fine for firstname?

    This category consists of the subcategories

    • Mn: Mark, Non-Spacing

      An example for this is ̀, U+0300. This is the COMBINING GRAVE ACCENT, and can be used together with a letter (the letter before) to create accented characters. For the commonly used accented characters there is already a precomposed form (e.g. é), but for other ones there is not.

    • Mc: Mark, Spacing Combining.

      These are quite seldom ... I found them mainly in south-asian scripts, and for musical notes. For example, we have U+1D165, MUSICAL SYMBOL COMBINING STEM. 텦, which could be combined with U+1D15D, MUSICAL SYMBOL WHOLE NOTE, 텝, to something like 텝텦. (Hmm, the images do not look right here. I suppose my browser does not support these characters. Have a look at the code charts, if they are wrong here.)

    • Me: Mark, Enclosing

      These are marks which somehow enclose the base letter (the previous one, if I understand right). One example would be U+20DD, ⃝, which allows creating things like A⃝. (This should be rendered as an A enclosed by a circle, if I understand right. It does not, in my browser.) Another one would be U+20E3, ⃣, COMBINING ENCLOSING KEYCAP, which should give the look of a key cap with the letter on it (A⃣). (They do not show in my browser. Have a look at the code chart, if you can't see them.)

    You can find them all by searching in Unicode-Data.txt for ;Mn;, ;Mc; or ;Me;, respectively. Some more information is in the FAQ: Characters and Combining Marks.

    Do you need them? I'm not sure here. Most common names (at least in latin alphabets) would use precomposed letters, I think. But the user might input them in decomposed form - I think on Mac OS X this is actually the default. You would have to run the normalization algorithm before filtering away unknown characters. (Running the normalization seems a good idea anyway if you want to compare the names and not only show them on screen.)


    Edit: not directly relating to the question, but relating to the discussion in the comments:

    I wrote a quick test program to show that [^\pL\pM] is not equivalent to [\PL\PM]:

    package de.fencing_game.paul.examples;
    
    import java.util.regex.*;
    
    public class RegexSample {
    
        static String[] regexps = {
            "[^\\pL\\pM]", "[\\PL\\PM]",
            ".", "\\pL", "\\pM",
            "\\PL", "\\PM"
        };
    
        static String[] strings = {
            "x", "A", "3", "\n", ".", "\t", "\r", "\f",
            " ", "-", "!", "»", "›", "‹", "«",
            "ͳ", "Θ", "Σ", "Ϫ", "Ж", "ؤ",
            "༬", "༺", "༼", "ང", "⃓", "✄",
            "⟪", "や", "゙", 
            "+", "→", "∑", "∢", "※", "⁉", "⧓", "⧻",
            "⑪", "⒄", "⒰", "ⓛ", "⓶",
            "\u0300" /* COMBINING GRAVE ACCENT, Mn */,
            "\u0BCD" /* TAMIL SIGN VIRAMA, Me */,
            "\u20DD" /* COMBINING ENCLOSING CIRCLE, Me */,
            "\u2166" /* ROMAN NUMERAL SEVEN, Nl */,
        };
    
    
        public static void main(String[] params) {
            Pattern[] patterns = new Pattern[regexps.length];
    
            System.out.print("       ");
            for(int i = 0; i < regexps.length; i++) {
                patterns[i] = Pattern.compile(regexps[i]);
                System.out.print("| " + patterns[i] + " ");
            }
            System.out.println();
            System.out.print("-------");
            for(int i = 0; i < regexps.length; i++) {
                System.out.print("|-" +
                                 "--------------".substring(0,
                                                            regexps[i].length()) +
                                 "-");
            }
            System.out.println();
    
            for(int j = 0; j < strings.length; j++) {
                System.out.printf("U+%04x ", (int)strings[j].charAt(0));
                for(int i = 0; i < regexps.length; i++) {
                    boolean match = patterns[i].matcher(strings[j]).matches();
                    System.out.print("| " + (match ? "✔" : "-")  +
                                     "         ".substring(0, regexps[i].length()));
                }
                System.out.println();
            }
        }
    }
    

    Here is the output (with OpenJDK 1.6.0_20 on OpenSUSE):

           | [^\pL\pM] | [\PL\PM] | . | \pL | \pM | \PL | \PM 
    -------|-----------|----------|---|-----|-----|-----|-----
    U+0078 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
    U+0041 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
    U+0033 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+000a | ✔         | ✔        | - | -   | -   | ✔   | ✔   
    U+002e | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+0009 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+000d | ✔         | ✔        | - | -   | -   | ✔   | ✔   
    U+000c | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+0020 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+002d | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+0021 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+00bb | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+203a | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+2039 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+00ab | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+0373 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+0398 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
    U+03a3 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
    U+03ea | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
    U+0416 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
    U+0624 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
    U+0f2c | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+0f3a | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+0f3c | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+0f44 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
    U+20d3 | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
    U+2704 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+27ea | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+3084 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
    U+3099 | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
    U+002b | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+2192 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+2211 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+2222 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+203b | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+2049 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+29d3 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+29fb | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+246a | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+2484 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+24b0 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+24db | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+24f6 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    U+0300 | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
    U+0bcd | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
    U+20dd | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
    U+2166 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
    

    We can see that:

    1. [^\pL\pM] is not equivalent to [\PL\PM]
    2. [\PL\PM] really matches everything, but
    3. still [\PL\PM] is not equal to ., since . does not match \n and \r.

    The second point is caused by the fact that [\PL\PM] is the union of \PL and \PM: \PL contains characters from all categories other than L (including M), and \PM contains characters from all categories other than M (including L) - together they contain the whole character repertoire.

    [^pL\pM], on the other hand, is the complement of the union of \pL and \pM, which is equivalent to the intersection of \PL and PM.