Search code examples
regexcapturing-groupnegative-lookbehind

Regex capturing the first occurrence of every group in a recurring pattern


Suppose I have the following text:

Name: John Doe\tAddress: Street 123 ABC\tCity: MyCity

I have a regex (a bit more complex, but it boils down to this):

^(?:(?:(?:Name: (.+?))|(?:Address: (.+?))|(?:City: (.+?)))\t*)+$

which has three capturing groups, that can capture the values of Name, Address and City (if they occur in the text). A few more examples are here: https://regex101.com/r/37nemH/6. EDIT The ordering is not fixed beforehand, and it could also happen that the fields are not separated by \t characters.

Now this all works well, the only slight problem I have is when one field occurs twice in the same text, as can be seen in the last example I put on regex101:

Name: John Doe\tAddress: Street 123 ABC\tCity: MyCity\tAddress: Other Address

What I would want is for the second capturing group to match the first address, i.e. Street 123 ABC, and preferably to let the second occurrence be matched within the "City" group, i.e.

1: John Doe
2: Street 123 ABC
3: MyCity\tAddress: Other Address

Conceptually, I tried doing this with a negative lookbehind, e.g. replacing (?:Address: (.+?)) with (?:(?<!.*Address: )Address: (.+?)), i.e. assuring that an Address: match was not proceded somewhere in the text by another Address: tag. But, negative lookbehind does not allow for arbitrary length, so this obviously would not work.

Can this be achieved using regex, and how?


Solution

  • If the word order can be any and some or all the items can be missing, it is much easier to use 3 separate patterns to extract the bits you need.

    Name (demo):

    ^.*?Name:\s*(.*?)(?=\s*(?:Name:|Address:|City:|$))
    

    City (demo):

    ^.*?City:\s*(.*?)(?=\s*(?:Name:|Address:|City:|$))
    

    Address (demo):

    ^.*?Address:\s*(.*?)(?=\s*(?:Name:|Address:|City:|$))
    

    Details

    • ^ - start of string
    • .*? - any 0+ chars other than line break chars, as few as possible
    • Address: - a keyword to stop at and look for the expected match
    • \s* - 0+ whitespaces
    • (.*?) - Group 1: any 0+ chars other than line break chars, as few as possible...
    • (?=\s*(?:Name:|Address:|City:|$)) - up to but excluding 0 or more whitespaces followed with Name:, Address:, City: or end of string.