I need to write a regular expression so I could replace the invalid characters in user's input before sending it further. I think i need to use string.replaceAll("regex", "replacement")
to do that.
The particular line of code should replace all characters which are not unicode letters. So it's a white list of unicode characters. Basically it's validating and replacing the invalid characters of user's first name.
What I've found so far is this: \p{L}\p{M}
, but I'm not sure how to fire it up in regexp so it would work as I explained above. Would this be a regex negation case?
Yes, you need negation. The regular expression would be [^\p{L}]
for anything except letters. Another way to write this would be \P{L}
.
\p{M}
means "all marks", thus [^\p{L}\p{M}]
means **anything which is neither letter nor mark. This also could be written as [\P{L}&&[\P{M}]]
, but this is not really better.
In a Java-String all \
have to be doubled, so you would write string.replaceAll("[^\\p{L}\\p{M}]", "replacement")
there.
From a comment:
By the way, regarding to your answer, what fall in the marks category? Do I even need that? Wouldn't just letters be fine for firstname?
This category consists of the subcategories
Mn: Mark, Non-Spacing
An example for this is ̀
, U+0300. This is the COMBINING GRAVE ACCENT, and can be used together with a letter (the letter before) to create accented characters. For the commonly used accented characters there is already a precomposed form (e.g. é
), but for other ones there is not.
Mc: Mark, Spacing Combining.
These are quite seldom ... I found them mainly in south-asian scripts, and for musical notes. For example, we have U+1D165, MUSICAL SYMBOL COMBINING STEM. 텦, which could be combined with U+1D15D, MUSICAL SYMBOL WHOLE NOTE, 텝, to something like 텝텦. (Hmm, the images do not look right here. I suppose my browser does not support these characters. Have a look at the code charts, if they are wrong here.)
Me: Mark, Enclosing
These are marks which somehow enclose the base letter (the previous one, if I understand right). One example would be U+20DD, ⃝, which allows creating things like A⃝
. (This should be rendered as an A enclosed by a circle, if I understand right. It does not, in my browser.) Another one would be U+20E3, ⃣, COMBINING ENCLOSING KEYCAP, which should give the look of a key cap with the letter on it (A⃣). (They do not show in my browser. Have a look at the code chart, if you can't see them.)
You can find them all by searching in Unicode-Data.txt for ;Mn;
, ;Mc;
or ;Me;
, respectively. Some more information is in the FAQ: Characters and Combining Marks.
Do you need them? I'm not sure here. Most common names (at least in latin alphabets) would use precomposed letters, I think. But the user might input them in decomposed form - I think on Mac OS X this is actually the default. You would have to run the normalization algorithm before filtering away unknown characters. (Running the normalization seems a good idea anyway if you want to compare the names and not only show them on screen.)
Edit: not directly relating to the question, but relating to the discussion in the comments:
I wrote a quick test program to show that [^\pL\pM]
is not equivalent to [\PL\PM]
:
package de.fencing_game.paul.examples;
import java.util.regex.*;
public class RegexSample {
static String[] regexps = {
"[^\\pL\\pM]", "[\\PL\\PM]",
".", "\\pL", "\\pM",
"\\PL", "\\PM"
};
static String[] strings = {
"x", "A", "3", "\n", ".", "\t", "\r", "\f",
" ", "-", "!", "»", "›", "‹", "«",
"ͳ", "Θ", "Σ", "Ϫ", "Ж", "ؤ",
"༬", "༺", "༼", "ང", "⃓", "✄",
"⟪", "や", "゙",
"+", "→", "∑", "∢", "※", "⁉", "⧓", "⧻",
"⑪", "⒄", "⒰", "ⓛ", "⓶",
"\u0300" /* COMBINING GRAVE ACCENT, Mn */,
"\u0BCD" /* TAMIL SIGN VIRAMA, Me */,
"\u20DD" /* COMBINING ENCLOSING CIRCLE, Me */,
"\u2166" /* ROMAN NUMERAL SEVEN, Nl */,
};
public static void main(String[] params) {
Pattern[] patterns = new Pattern[regexps.length];
System.out.print(" ");
for(int i = 0; i < regexps.length; i++) {
patterns[i] = Pattern.compile(regexps[i]);
System.out.print("| " + patterns[i] + " ");
}
System.out.println();
System.out.print("-------");
for(int i = 0; i < regexps.length; i++) {
System.out.print("|-" +
"--------------".substring(0,
regexps[i].length()) +
"-");
}
System.out.println();
for(int j = 0; j < strings.length; j++) {
System.out.printf("U+%04x ", (int)strings[j].charAt(0));
for(int i = 0; i < regexps.length; i++) {
boolean match = patterns[i].matcher(strings[j]).matches();
System.out.print("| " + (match ? "✔" : "-") +
" ".substring(0, regexps[i].length()));
}
System.out.println();
}
}
}
Here is the output (with OpenJDK 1.6.0_20 on OpenSUSE):
| [^\pL\pM] | [\PL\PM] | . | \pL | \pM | \PL | \PM
-------|-----------|----------|---|-----|-----|-----|-----
U+0078 | - | ✔ | ✔ | ✔ | - | - | ✔
U+0041 | - | ✔ | ✔ | ✔ | - | - | ✔
U+0033 | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+000a | ✔ | ✔ | - | - | - | ✔ | ✔
U+002e | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+0009 | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+000d | ✔ | ✔ | - | - | - | ✔ | ✔
U+000c | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+0020 | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+002d | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+0021 | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+00bb | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+203a | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+2039 | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+00ab | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+0373 | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+0398 | - | ✔ | ✔ | ✔ | - | - | ✔
U+03a3 | - | ✔ | ✔ | ✔ | - | - | ✔
U+03ea | - | ✔ | ✔ | ✔ | - | - | ✔
U+0416 | - | ✔ | ✔ | ✔ | - | - | ✔
U+0624 | - | ✔ | ✔ | ✔ | - | - | ✔
U+0f2c | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+0f3a | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+0f3c | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+0f44 | - | ✔ | ✔ | ✔ | - | - | ✔
U+20d3 | - | ✔ | ✔ | - | ✔ | ✔ | -
U+2704 | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+27ea | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+3084 | - | ✔ | ✔ | ✔ | - | - | ✔
U+3099 | - | ✔ | ✔ | - | ✔ | ✔ | -
U+002b | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+2192 | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+2211 | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+2222 | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+203b | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+2049 | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+29d3 | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+29fb | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+246a | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+2484 | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+24b0 | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+24db | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+24f6 | ✔ | ✔ | ✔ | - | - | ✔ | ✔
U+0300 | - | ✔ | ✔ | - | ✔ | ✔ | -
U+0bcd | - | ✔ | ✔ | - | ✔ | ✔ | -
U+20dd | - | ✔ | ✔ | - | ✔ | ✔ | -
U+2166 | ✔ | ✔ | ✔ | - | - | ✔ | ✔
We can see that:
[^\pL\pM]
is not equivalent to [\PL\PM]
[\PL\PM]
really matches everything, but[\PL\PM]
is not equal to .
, since .
does not match \n
and \r
.The second point is caused by the fact that [\PL\PM]
is the union of \PL
and \PM
: \PL
contains characters from all categories other than L (including M), and \PM
contains characters from all categories other than M (including L) - together they contain the whole character repertoire.
[^pL\pM]
, on the other hand, is the complement of the union of \pL
and \pM
, which is equivalent to the intersection of \PL
and PM
.