Search code examples
regexpcre

Is \p{L} equivalent to [^\d\s]


To match letters are these two regular expressions equivalent? Is one generally more preferable? Or is this a case of "it depends"?

1.Unicode letter short code:

\p{L}

2.Negated PCRE short codes for digits and whitespaces:

[^\d\s]

Solution

  • They are not equivalent.

    Assuming you use the u option, \p{L} means "letter (Category L)". [^\s\d] means "not a whitespace (Category Z), and not a digit (Category Nd)". If every character indeed belongs to one of the three categories, then you'd be right because of set theory, but there are characters that do not belong to any of the three categories.

    The comma , for example, is a punctuation (Category P), and will be matched by [^\s\d], but not \p{L}.

    In fact, there are a lot more than 3 categories in Unicode.

    So to actually use a negation to represent \p{L}, you'd have to say:

    [^\p{C}\p{M}\p{N}\p{P}\p{S}\p{Z}]
    

    basically listing all other categories. But it will break as soon as Unicode decides to add a new category and PCRE decides to support it. Needless to say, please don't use it in production :)