I have a list of strings that I need to filter through using regex. Some of the strings may contain URLS in the form of '(random_chars).(random_chars).(random_chars).(random_chars)...' etc.
I am trying to create a regex that will find such URLS but ignore URLS where the first set of (random_chars) does not match 'java'.
For example the strings below:
"test string (test.url.com) abcdef java.lang.Assertion uvwxyz www.google.com abcdef"
I'd expect it to match test.url.com and www.google.com but not java.lang.Assertion
"another test string /abc/xyz/lib/def/GH.tr test 200."
I wouldn't want it to match GH.tr
My current regex will match the below:
This is my current regex, and I have attempted to use a negative lookahead:
(?!java)(?:(?:\w+\.)+[\w]+)
What have I missed with my regex?
You get those matches because the negative lookahead (?!java)
asserts that what is directly on the right is not java.
That is false when the position is right before java.lang.Assertion
, so that does not match.
But then moving to the j
, then the assertion is true because on the right is now ava.lang.Assertion
so that will match.
One option could be to match what you don't want to keep using (*SKIP)(*FAIL)
. Then match what you want to keep.
\bjava(?:\.\w+)+(*SKIP)(*FAIL)|(?<!/)\b\w+(?:\.\w+)+
That will match
\bjava(?:\.\w+)+(*SKIP)(*FAIL)
Pattern to match what you don't want to keep|
Or(?<!/)
Negative lookbehind, assert what is on the left is not a forward slash\b\w+(?:\.\w+)+
Pattern that you want to match starting with a word boundary