I have fields which contain data in the following possible formats (each line is a different possibility):
AAA - Something Here
AAA - Something Here - D
Something Here
Note that the first group of letters (AAA) can be of varying lengths.
What I am trying to capture is the "Something Here" or "Something Here - D" (if it exists) using PCRE, but I can't get the Regex to work properly for all three cases. I have tried:
- (.*)
which works fine for cases 1 and 2 but obviously not 3;
(?<= - )(.*)
which also works fine for cases 1 and 2;
(?! - )(.+)| - (.+)
works for cases 2 and 3 but not 1.
I feel like I'm on the verge of it but I can't seem to crack it.
Thanks in advance for your help.
Edit: I realized that I was unclear in my requirements. If there is a trailing " - D" (the letter in the data is arbitrary but should only be a single character), that needs to be captured as well.
About the patterns that you tried:
- (.*)
This pattern will match the first occurrence of -
followed by matching the rest of the line. It will match too much for the second example as the .*
will also match the second occurrence of -
(?<= - )(.*)
This pattern will match the same as the first example without the -
as it asserts that is should occur directly to the left(?! - )(.+)| - (.+)
This pattern uses a negative lookahead which asserts what is directly to the right is not (?! - )
. As none of the example start with -
, the whole line will be matched directly after the negative lookahead due to .+
and the second part after the alternation |
will not be evaluatedIf the first group of letters can be of varying length, you could make the match either specific matching 1 or more uppercase characters [A-Z]+
or 1+ word characters \w+
.
To get a more broad match, you could match 1 or more non whitespace characters using \S+
^(?:\S+\h-\h)?\K\S+(?:\h(?!-\h)\S+)*
Explanation
^
Start of string(?:\S+\h-\h)?
Optionally match the first group of non whitespace chars followed by -
between horizontal whitespace chars\K
Clear the match buffer (Forget what is currently matched)\S+
Match 1+ non whitespace characters(?:
Non capture group
\h(?!-\h)
Match a horizontal whitespace char and assert what is directly to the right is not -
followed by another horizontal whitespace char\S+
Match 1+ non whitespace chars)*
Close non capture group and repeat 1+ times to match more "words" separated by spacesEdit
To match an optional hyphen and trailing single character, you could add an optional non capturing group (?:-\h\S\h*)?$
and assert the end of the string if the pattern should match the whole string:
^(?:\S+\h-\h)?\K\S+(?:\h(?!-\h)\S+)*\h*(?:-\h\S\h*)?$