I'm trying to implement a regex expression to parse out different groups of a string. The string I am provided signals a customer who wants to change their primary language on our platform from X to Y, so it will look like
language_change__from_english_to_spanish
language_change__from_spanish_to_somali
language_change__from_simplified_chinese_to_english
I have a regex to parse out the languages (the original language, and then the new language):
SUBSTRING(language_field FROM '^language_change__([a-zA-Z_]+)_to_[a-zA-Z_]+$')
The works great for the original language, but it does not work as well for the new language, because sometimes people have entered in language_change__from_english_to_spanish_
. So what is actually parsed is spanish_
instead of spanish
.
I know that I could easily just wrap my SUBSTRING(...)
with a REPLACE
to replace the final _
, but I'm trying to find a way to do it using regex to sharpen my regex skills. I basically want to ignore the final _
. However, I cannot remove _
from my capture group, because some languages coded in our system have _
: simplified_chinese
, for instance.
Basically, I want my capture group to include the first _
followed by text (like simplified_chinese
), but do not include the _
if it is at the end of the string (english_
).
So when language_change__from_spanish_to_english_
is entered, I should still capture english
, and not english_
as the language.
I tried various ways:
SUBSTRING(language_field FROM '^language_change__[a-zA-Z_]+_to_([a-zA-Z_]+)(?!_)$')
SUBSTRING(language_field FROM '^language_change__[a-zA-Z_]+_to_([a-zA-Z_]+?)$')
SUBSTRING(language_field FROM '^language_change__[a-zA-Z_]+_to_([a-zA-Z_]+?)(?!_)$')
I feel like I've been on StackOverflow long enough to not say "nothing works", but all three things I tried above still yield english_
and not english
when people incorrectly have entered language_change__from_spanish_to_english_
on our end.
Can someone give me some pointers what I'm missing?
You may use
SUBSTRING(language_field FROM '^language_change__[a-zA-Z_]+_to_([a-zA-Z_]+?)_*$')
The ([a-zA-Z_]+?)_*$
part means:
([a-zA-Z_]+?)
- Group 1: match 1 or more ASCII letters or _
as few times as possible_*
- match 0 or more _
chars$
- at the end of string.The ([a-zA-Z_]+?)
pattern will match 1 char, then _*$
pattern part will be tried. If _*$
fails to match, the engine will backtrack and the ([a-zA-Z_]+?)
pattern will grab another char and the tests will repeat. Hence, if _
s exist at the end of the string, they will not be included in the capturing group, they will be matched with _*
part, and will thus be discarded from the result.
See this regex debugger step by step to see the regex in action.