Search code examples
mysqlregexregexp-replacemysql-8.0

MySQL REGEXP_REPLACE translation from PCRE REGEXP


I'm trying to extract the volume of a string field to sort by it numerically.


Given the following data:

  1. Something at 300 ml
  2. 300Ml somthing
  3. Something special (with 300 Ml)
  4. 8-v something that should not match

First attempt: Just cast the string to unsigned (found in this thread: reference)

Problem: It'll obviously also take the 4th data example into account.


Second attempt: Use REGEXP_REPLACE to extract the value (found in this thread: reference)

CAST(
    REGEXP_REPLACE(
        LOWER(column),
        "[:digit:]+[:space:]*ml",
        "///"
    ) as UNSIGNED
) AS volume

...using the character_classes defined in the manual.

Problem: The character_classes seem not to work as intended (possibly an issue with the character class name for spaces?).


Third attempt: Using the common Regexp tokens

CAST(
    REGEXP_REPLACE(
        LOWER(column),
        "\d+\s*ml",
        "///"
    ) as UNSIGNED
) AS volume

...which seems to work better.

Problem: It also matches the 4th data example again.


Question: How do I perform some kind of positive lookahead in MySQL REGEXP_REPLACE that will only match data examples 1-3 (ignoring the case sensitiveness due to casting to lower)?

The one i've tried seem to work fine in PCRE (with global flag) but not in MySQL:

^.*\d+\s*ml.*$ (ignoring the missing capturing group)

Which is translated like: Match everything until 1 or more digits, followed by 0 or more whitespaces and the string ml is found.


Solution

  • Try this:

           CAST(
              REGEXP_SUBSTR( 
                    LOWER(@c), 
                    "([[:digit:]]+)[[:space:]]*ml" )
              AS UNSIGNED)  AS VOLUME;
    

    You definitely want REGEX_SUBSTR() rather than REGEX_REPLACE() for your purpose.

    I've tried it on MySQL 8 and MariaDB 10.3 - 10.5. Doubling the brackets [[:digit:]] was required, for me anyhow. And if you use the older \s notation, you need to double up the \ escape character: \\s+

    (MariaDb's implementation of REGEX_SUBSTR() does not accept the optional parameters that MySQL's does.

    I have to say, having worked in health care IT, that combining regular expressions with drug dosages frightens me. Be careful! Test! Test! If you have patients, they will thank you.