Search code examples
javaregexstringunicodecharacter-properties

Checking for specific strings with regex


I have a list of arbitrary length of Type String, I need to ensure each String element in the list is alphanumerical or numerical with no spaces and special characters such as - \ / _ etc.

Example of accepted strings include:

J0hn-132ss/sda
Hdka349040r38yd
Hd(ersd)3r4y743-2\d3
123456789

Examples of unacceptable strings include:

Hello
Joe
King

etc basically no words.

I’m currently using stringInstance.matches("regex") but not too sure on how to write the appropriate expression

if (str.matches("^[a-zA-Z0-9_/-\\|]*$")) return true; 
else return false;

This method will always return true for words that don't conform to the format I mentioned.

A description of the regex I’m looking for in English would be something like:
Any String, where the String contains characters from (a-zA-Z AND 0-9 AND special characters)
OR (0-9 AND Special characters)
OR (0-9)

Edit: I have come up with the following expression which works but I feel that it may be bad in terms of it being unclear or to complex.

The expression:

(([\\pL\\pN\\pP]+[\\pN]+|[\\pN]+[\\pL\\pN\\pP]+)|([\\pN]+[\\pP]*)|([\\pN]+))+

I've used this website to help me: http://xenon.stanford.edu/~xusch/regexp/analyzer.html
Note that I’m still new to regex


Solution

  • WARNING: “Never” Write A-Z

    All instances of ranges like A-Z or 0-9 that occur outside an RFC definition are virtually always ipso facto wrong in Unicode. In particular, things like [A-Za-z] are horrible antipatterns: they’re sure giveaways that the programmer has a caveman mentality about text that is almost wholly inappropriate this side of the Millennium. The Unicode patterns work on ASCII, but the ASCII patterns break on Uniocode, sometimes in ways that leave you open to security violations. Always write the Unicode version of the pattern no matter whether you are using 1970s data or modern Unicode, because that way you won’t screw up when you actually use real Java character data. It’s like the way you use your turn signal even when you “know” there is no one behind you, because if you’re wrong, you do no harm, whereas the other way, you very most certainly do. Get used to using the 7 Unicode categories:

    1. \pL for Letters. Notice how \pL is a lot shorter to type than [A-Za-z].
    2. \pN for Numbers.
    3. \pM for Marks that combine with other code points.
    4. \pS for Symbols, Signs, and Sigils. :)
    5. \pP for Punctuation.
    6. \pZ for Separators like spaces (but not control characters)
    7. \pC for other invisible formatting and Control characters, including unassigned code points.

    Solution

    If you just want a pattern, you want

     ^[\pL\pN]+$
    

    although in Java 7 you can do this:

     (?U)^\w+$
    

    assuming you don’t mind underscores and letters with arbitrary combining marks. Otherwise you have to write the very awkward:

     (?U)^[[:alpha:]\pN]+$
    

    The (?U) is new to Java 7. It corresponds to the Pattern class’s UNICODE_CHARACTER_CLASSES compilation flag. It switches the POSIX character classes like [:alpha:] and the simple shortcuts like \w to actually work with the full Java character set. Normally, they work only on the 1970sish ASCII set, which can be a security hole.

    There is no way to make Java 7 always do this with its patterns without being told to, but you can write a frontend function that does this for you. You just have to remember to call yours instead.

    Note that patterns in Java before v1.7 cannot be made to work according to the way UTS#18 on Unicode Regular Expressions says they must. Because of this, you leave yourself open to a wide range of bugs, infelicities, and paradoxes if you do not use the new Unicode flag. For example, the trivial and common pattern \b\w+\b will not be found to match anywhere at all within the string "élève", let alone in its entirety.

    Therefore, if you are using patterns in pre-1.7 Java, you need to be extremely careful, far more careful than anyone ever is. You cannot use any of the POSIX charclasses or charclass shortcuts, including \w, \s, and \b, all of which break on anything but stone-age ASCII data. They cannot be used on Java’s native character set.

    In Java 7, they can — but only with the right flag.