Search code examples
regexpcre

Regex to capture everything after optional token


I have fields which contain data in the following possible formats (each line is a different possibility):

AAA - Something Here  
AAA - Something Here - D  
Something Here 

Note that the first group of letters (AAA) can be of varying lengths.

What I am trying to capture is the "Something Here" or "Something Here - D" (if it exists) using PCRE, but I can't get the Regex to work properly for all three cases. I have tried:

- (.*) which works fine for cases 1 and 2 but obviously not 3;

(?<= - )(.*) which also works fine for cases 1 and 2;

(?! - )(.+)| - (.+) works for cases 2 and 3 but not 1.

I feel like I'm on the verge of it but I can't seem to crack it.

Thanks in advance for your help.

Edit: I realized that I was unclear in my requirements. If there is a trailing " - D" (the letter in the data is arbitrary but should only be a single character), that needs to be captured as well.


Solution

  • About the patterns that you tried:

    • - (.*)This pattern will match the first occurrence of - followed by matching the rest of the line. It will match too much for the second example as the .* will also match the second occurrence of -
    • (?<= - )(.*)This pattern will match the same as the first example without the - as it asserts that is should occur directly to the left
    • (?! - )(.+)| - (.+) This pattern uses a negative lookahead which asserts what is directly to the right is not (?! - ). As none of the example start with - , the whole line will be matched directly after the negative lookahead due to .+ and the second part after the alternation | will not be evaluated

    If the first group of letters can be of varying length, you could make the match either specific matching 1 or more uppercase characters [A-Z]+ or 1+ word characters \w+.

    To get a more broad match, you could match 1 or more non whitespace characters using \S+

    ^(?:\S+\h-\h)?\K\S+(?:\h(?!-\h)\S+)*
    

    Explanation

    • ^ Start of string
    • (?:\S+\h-\h)? Optionally match the first group of non whitespace chars followed by - between horizontal whitespace chars
    • \K Clear the match buffer (Forget what is currently matched)
    • \S+ Match 1+ non whitespace characters
    • (?: Non capture group
      • \h(?!-\h) Match a horizontal whitespace char and assert what is directly to the right is not - followed by another horizontal whitespace char
      • \S+ Match 1+ non whitespace chars
    • )* Close non capture group and repeat 1+ times to match more "words" separated by spaces

    Regex demo

    Edit

    To match an optional hyphen and trailing single character, you could add an optional non capturing group (?:-\h\S\h*)?$ and assert the end of the string if the pattern should match the whole string:

    ^(?:\S+\h-\h)?\K\S+(?:\h(?!-\h)\S+)*\h*(?:-\h\S\h*)?$
                                           
    

    Regex demo