Search code examples
phpregexunicodepcre

Regular expression to match unicode block, or index range


I'm trying to create a regular expression that will match any characters in a unicode block - specifically the Mathematical Alphanumeric Symbols block.

The intention here is to identify the use of content using Unicode characters to get different formatting on their text, like bold or italic text when it's not supported generally. There are plenty of websites, like this one that help users convert text.

I've tried using the shorthand property code, but it doesn't seem to match all characters I'd expect from the block.

preg_match('/\p{Sm}/i', '𝟮') === 1; // false

It doesn't appear as though PHP supports the named variants either, so I can't do something like \p{Math}.

I believe I need to target the block range - which is from U+1D400 - U+1D7FF, but I cannot work out how to correctly build this regex. This is how I thought I would have it work, but it doesn't appear to work.

preg_match('/\x{1D400}-\x{1D7FF}/i', '𝗮') === 1; // false

I would expect none of these characters to match (typed straight on my keyboard):

abcdefghijklmnopqrstuvwxyz0123456789

I would expect every single one of these characters to match (same as above, converted to Math bold using the link above):

𝐚𝐛𝐜𝐝𝐞𝐟𝐠𝐡𝐢𝐣𝐤𝐥𝐦𝐧𝐨𝐩𝐪𝐫𝐬𝐭𝐮𝐯𝐰𝐱𝐲𝐳𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗

Solution

  • I'm guessing that this expression might work, not sure though:

    $re = '/[\x{1D400}-\x{1D7FF}]+/su';
    $str = '𝐚𝐛𝐜𝐝𝐞𝐟𝐠𝐡𝐢𝐣𝐤𝐥𝐦𝐧𝐨𝐩𝐪𝐫𝐬𝐭𝐮𝐯𝐰𝐱𝐲𝐳𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗';
    preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
    var_dump($matches);
    

    \p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
    \p{Sm} or \p{Math_Symbol}: any mathematical symbol.
    \p{Sc} or \p{Currency_Symbol}: any currency sign.
    \p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
    \p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
    

    The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.


    Reference

    RegEx for Mathematical Alphanumeric Symbols

    Unicode Regular Expressions