How to properly write regex for unicode first name in Java?

I need to write a regular expression so I could replace the invalid characters in user's input before sending it further. I think i need to use string.replaceAll("regex", "replacement") to do that. The particular line of code should replace all characters which are not unicode letters. So it's a white list of unicode characters. Basically it's validating and replacing the invalid characters of user's first name.

What I've found so far is this: \p{L}\p{M}, but I'm not sure how to fire it up in regexp so it would work as I explained above. Would this be a regex negation case?

Solution

Yes, you need negation. The regular expression would be [^\p{L}] for anything except letters. Another way to write this would be \P{L}.

\p{M} means "all marks", thus [^\p{L}\p{M}] means **anything which is neither letter nor mark. This also could be written as [\P{L}&&[\P{M}]], but this is not really better.

In a Java-String all \ have to be doubled, so you would write string.replaceAll("[^\\p{L}\\p{M}]", "replacement") there.

From a comment:

By the way, regarding to your answer, what fall in the marks category? Do I even need that? Wouldn't just letters be fine for firstname?

This category consists of the subcategories

Mn: Mark, Non-Spacing

An example for this is ̀, U+0300. This is the COMBINING GRAVE ACCENT, and can be used together with a letter (the letter before) to create accented characters. For the commonly used accented characters there is already a precomposed form (e.g. é), but for other ones there is not.
Mc: Mark, Spacing Combining.

These are quite seldom ... I found them mainly in south-asian scripts, and for musical notes. For example, we have U+1D165, MUSICAL SYMBOL COMBINING STEM. 텦, which could be combined with U+1D15D, MUSICAL SYMBOL WHOLE NOTE, 텝, to something like 텝텦. (Hmm, the images do not look right here. I suppose my browser does not support these characters. Have a look at the code charts, if they are wrong here.)
Me: Mark, Enclosing

These are marks which somehow enclose the base letter (the previous one, if I understand right). One example would be U+20DD, ⃝, which allows creating things like A⃝. (This should be rendered as an A enclosed by a circle, if I understand right. It does not, in my browser.) Another one would be U+20E3, ⃣, COMBINING ENCLOSING KEYCAP, which should give the look of a key cap with the letter on it (A⃣). (They do not show in my browser. Have a look at the code chart, if you can't see them.)

You can find them all by searching in Unicode-Data.txt for ;Mn;, ;Mc; or ;Me;, respectively. Some more information is in the FAQ: Characters and Combining Marks.

Do you need them? I'm not sure here. Most common names (at least in latin alphabets) would use precomposed letters, I think. But the user might input them in decomposed form - I think on Mac OS X this is actually the default. You would have to run the normalization algorithm before filtering away unknown characters. (Running the normalization seems a good idea anyway if you want to compare the names and not only show them on screen.)

Edit: not directly relating to the question, but relating to the discussion in the comments:

I wrote a quick test program to show that [^\pL\pM] is not equivalent to [\PL\PM]:

package de.fencing_game.paul.examples;

import java.util.regex.*;

public class RegexSample {

    static String[] regexps = {
        "[^\\pL\\pM]", "[\\PL\\PM]",
        ".", "\\pL", "\\pM",
        "\\PL", "\\PM"
    };

    static String[] strings = {
        "x", "A", "3", "\n", ".", "\t", "\r", "\f",
        " ", "-", "!", "»", "›", "‹", "«",
        "ͳ", "Θ", "Σ", "Ϫ", "Ж", "ؤ",
        "༬", "༺", "༼", "ང", "⃓", "✄",
        "⟪", "や", "゙", 
        "+", "→", "∑", "∢", "※", "⁉", "⧓", "⧻",
        "⑪", "⒄", "⒰", "ⓛ", "⓶",
        "\u0300" /* COMBINING GRAVE ACCENT, Mn */,
        "\u0BCD" /* TAMIL SIGN VIRAMA, Me */,
        "\u20DD" /* COMBINING ENCLOSING CIRCLE, Me */,
        "\u2166" /* ROMAN NUMERAL SEVEN, Nl */,
    };


    public static void main(String[] params) {
        Pattern[] patterns = new Pattern[regexps.length];

        System.out.print("       ");
        for(int i = 0; i < regexps.length; i++) {
            patterns[i] = Pattern.compile(regexps[i]);
            System.out.print("| " + patterns[i] + " ");
        }
        System.out.println();
        System.out.print("-------");
        for(int i = 0; i < regexps.length; i++) {
            System.out.print("|-" +
                             "--------------".substring(0,
                                                        regexps[i].length()) +
                             "-");
        }
        System.out.println();

        for(int j = 0; j < strings.length; j++) {
            System.out.printf("U+%04x ", (int)strings[j].charAt(0));
            for(int i = 0; i < regexps.length; i++) {
                boolean match = patterns[i].matcher(strings[j]).matches();
                System.out.print("| " + (match ? "✔" : "-")  +
                                 "         ".substring(0, regexps[i].length()));
            }
            System.out.println();
        }
    }
}

Here is the output (with OpenJDK 1.6.0_20 on OpenSUSE):

       | [^\pL\pM] | [\PL\PM] | . | \pL | \pM | \PL | \PM 
-------|-----------|----------|---|-----|-----|-----|-----
U+0078 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+0041 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+0033 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+000a | ✔         | ✔        | - | -   | -   | ✔   | ✔   
U+002e | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0009 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+000d | ✔         | ✔        | - | -   | -   | ✔   | ✔   
U+000c | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0020 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+002d | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0021 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+00bb | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+203a | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2039 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+00ab | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0373 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0398 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+03a3 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+03ea | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+0416 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+0624 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+0f2c | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0f3a | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0f3c | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0f44 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+20d3 | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
U+2704 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+27ea | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+3084 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+3099 | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
U+002b | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2192 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2211 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2222 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+203b | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2049 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+29d3 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+29fb | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+246a | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2484 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+24b0 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+24db | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+24f6 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0300 | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
U+0bcd | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
U+20dd | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
U+2166 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔

We can see that:

[^\pL\pM] is not equivalent to [\PL\PM]
[\PL\PM] really matches everything, but
still [\PL\PM] is not equal to ., since . does not match \n and \r.

The second point is caused by the fact that [\PL\PM] is the union of \PL and \PM: \PL contains characters from all categories other than L (including M), and \PM contains characters from all categories other than M (including L) - together they contain the whole character repertoire.

[^pL\pM], on the other hand, is the complement of the union of \pL and \pM, which is equivalent to the intersection of \PL and PM.