Search code examples
regexregex-lookaroundsregex-group

Regex to get the last match of a pattern


Here is a string similar to what I'm trying to match (with the exception of a couple of specific patterns, for the sake of simplicity). Hello, tonight I'm in the town of Trenton in New Jersey and I will be staying in Hotel HomeStay [123] and I have no money.

I'm trying to match only the last in Hotel HomeStay [123].

I'm not very familiar with regex concepts like lookahead and lookbehind right now. Similar questions here don't seem to solve my issue. I've tried a bunch of regex (to the best of my understanding) and this is what I came up with (?= (?:in|\d+))([\w \[]*\s*\d*\]*)(?!.*in). The digits and special characters may be part of what I'm actually trying to match.

The lookahead and lookbehind patterns are not restricted to containing only in. They can have more common words as well such as and and is. I'm only looking for the last occurence of any of these, followed by the main pattern, which is quite distinctive -- edit let's say the match should necessarily contain either HomeStay or LuxuryInn, for the sake of the example.

However, this matches the whole of in the town of Trenton in New Jersey and I will be staying in Hotel HomeStay [123]. Where am I going wrong? Also, could someone explain why the in is captured despite being placed in a non-capturing group?

Any help is greatly appreciated.


Solution

  • If you want to retrieve a text containing HomeStay prefixed by certain words and not containing those words, you can use a capture group using negative look-ahead inside. The regex below captures all occurrences (working fiddle).

    \b(?:in|and|is)\s+((?:.(?!\b(?:in|and|is)\b))*HomeStay(?:.(?!\b(?:in|and|is)\b))*)
    

    Here, the regexp looks for :

    • a given prefix (in, and or is as a whole word, surrounded by word breakers \b)
    • ... followed by at least one blank character,
    • ... then a sequence of 0 or more characters each one not followed by a prefix,
    • ... followed by HomeStay,
    • ... followed by another sequence of 0 or more characters, each one still not followed by a prefix

    If you just want the last occurrence, you can add another negative look-ahead after (fiddle).

    \b(?:in|and|is)\s+((?:.(?!\b(?:in|and|is)\b))*HomeStay(?:.(?!\b(?:in|and|is)\b))*)(?!.*HomeStay.*)
    

    Same as above, except the matched text must not be followed by a text containing HomeStay.

    Finally, if the matching text has to contain at least a word from a list, just replace both occurrences of HomeStay with a list of alternatives. Example for HomeStay and Luxury: (?:HomeStay|Luxury) (fiddle).