I'm trying to extract the volume of a string field to sort by it numerically.
Given the following data:
First attempt:
Just cast the string to unsigned
(found in this thread: reference)
Problem: It'll obviously also take the 4th data example into account.
Second attempt: Use REGEXP_REPLACE to extract the value (found in this thread: reference)
CAST(
REGEXP_REPLACE(
LOWER(column),
"[:digit:]+[:space:]*ml",
"///"
) as UNSIGNED
) AS volume
...using the character_class
es defined in the manual.
Problem: The character_class
es seem not to work as intended (possibly an issue with the character class name for spaces?).
Third attempt: Using the common Regexp tokens
CAST(
REGEXP_REPLACE(
LOWER(column),
"\d+\s*ml",
"///"
) as UNSIGNED
) AS volume
...which seems to work better.
Problem: It also matches the 4th data example again.
Question: How do I perform some kind of positive lookahead in MySQL REGEXP_REPLACE that will only match data examples 1-3 (ignoring the case sensitiveness due to casting to lower)?
The one i've tried seem to work fine in PCRE (with global flag) but not in MySQL:
^.*\d+\s*ml.*$
(ignoring the missing capturing group)
Which is translated like: Match everything until 1 or more digits, followed by 0 or more whitespaces and the string ml
is found.
Try this:
CAST(
REGEXP_SUBSTR(
LOWER(@c),
"([[:digit:]]+)[[:space:]]*ml" )
AS UNSIGNED) AS VOLUME;
You definitely want REGEX_SUBSTR() rather than REGEX_REPLACE() for your purpose.
I've tried it on MySQL 8 and MariaDB 10.3 - 10.5. Doubling the brackets [[:digit:]]
was required, for me anyhow. And if you use the older \s
notation, you need to double up the \
escape character: \\s+
(MariaDb's implementation of REGEX_SUBSTR() does not accept the optional parameters that MySQL's does.
I have to say, having worked in health care IT, that combining regular expressions with drug dosages frightens me. Be careful! Test! Test! If you have patients, they will thank you.