Search code examples
regexpostgresqlregex-lookaroundsregex-greedy

Non-greedy capture groups in Postgres


I'm trying to implement a regex expression to parse out different groups of a string. The string I am provided signals a customer who wants to change their primary language on our platform from X to Y, so it will look like

  • language_change__from_english_to_spanish
  • language_change__from_spanish_to_somali
  • language_change__from_simplified_chinese_to_english

I have a regex to parse out the languages (the original language, and then the new language):

SUBSTRING(language_field FROM '^language_change__([a-zA-Z_]+)_to_[a-zA-Z_]+$')

The works great for the original language, but it does not work as well for the new language, because sometimes people have entered in language_change__from_english_to_spanish_. So what is actually parsed is spanish_ instead of spanish.

I know that I could easily just wrap my SUBSTRING(...) with a REPLACE to replace the final _, but I'm trying to find a way to do it using regex to sharpen my regex skills. I basically want to ignore the final _. However, I cannot remove _ from my capture group, because some languages coded in our system have _: simplified_chinese, for instance.

Basically, I want my capture group to include the first _ followed by text (like simplified_chinese), but do not include the _ if it is at the end of the string (english_).

So when language_change__from_spanish_to_english_ is entered, I should still capture english, and not english_ as the language.

I tried various ways:

  • Using negative lookahead: SUBSTRING(language_field FROM '^language_change__[a-zA-Z_]+_to_([a-zA-Z_]+)(?!_)$')
  • Using non-greedy capture groups: SUBSTRING(language_field FROM '^language_change__[a-zA-Z_]+_to_([a-zA-Z_]+?)$')
  • Using a combination of negative lookahead and non-greedy capture groups: SUBSTRING(language_field FROM '^language_change__[a-zA-Z_]+_to_([a-zA-Z_]+?)(?!_)$')

I feel like I've been on StackOverflow long enough to not say "nothing works", but all three things I tried above still yield english_ and not english when people incorrectly have entered language_change__from_spanish_to_english_ on our end.

Can someone give me some pointers what I'm missing?


Solution

  • You may use

    SUBSTRING(language_field FROM '^language_change__[a-zA-Z_]+_to_([a-zA-Z_]+?)_*$')
    

    The ([a-zA-Z_]+?)_*$ part means:

    • ([a-zA-Z_]+?) - Group 1: match 1 or more ASCII letters or _ as few times as possible
    • _* - match 0 or more _ chars
    • $ - at the end of string.

    The ([a-zA-Z_]+?) pattern will match 1 char, then _*$ pattern part will be tried. If _*$ fails to match, the engine will backtrack and the ([a-zA-Z_]+?) pattern will grab another char and the tests will repeat. Hence, if _s exist at the end of the string, they will not be included in the capturing group, they will be matched with _* part, and will thus be discarded from the result.

    See this regex debugger step by step to see the regex in action.