Every topic I've read combining Python's Regex (re library) and Inverse/Negative matching has focused on multiline strings as opposed to SINGLE line strings.
Beyond the fact that http://www.regextester.com/15 uses a JavaScript regex library displaying matches for the entire group (/g) and behaves differently from Python's re library (apparently according to https://rexegg.com/ there's another regex library in Python which I don't wish to use just yet), I wanted to know if there was a way to use "re.findall" (and yes re.search although I'm privy to re.findall) to do 2 things: 1. Return all individual strings that do not contain the string "hede" in qw below. 2. Return all individual strings that do not contain the string "hede" and break strings containing the string "hede" on either side.
>>> qw = "hoho hihi haha hede rara a rere titi so whdhdskhdshede wekjewhkwqjhededjfjfj so kjkfdjkdnekjdhide b hede kdjkdld"
Scenario 1 Desired Output (exclude all strings that contain "hede"):
>>> qw ='hoho hihi haha hede rara a rere titi so whdhdskhdshede wekjewhkwqjhededjfjfj so kjkfdjkdnekjdhide b hede kdjkdld'
>>> re.findall('{SOMETHING_THAT_EXCLUDES_ALL_STRINGS_COTAINING_hede}', qw)
['hoho', 'hihi', 'haha', 'rara', 'a', 'rere', 'titi', 'so', 'so', 'kjkfdjkdnekjdhide', 'b', 'kdjkdld']
Scenario 2 Desired Output (include everything that doesn't contain "hede" and break strings contaiinig "hede" at "hede"):
>>> qw ='hoho hihi haha hede rara a rere titi so whdhdskhdshede wekjewhkwqjhededjfjfj so kjkfdjkdnekjdhide b hede kdjkdld'
>>> re.findall('{SOMETHING_THAT_INCLUDES_ALL_STRINGS_NOT_COTAINING_hede_AND_BREAKS_THEM_IF_THEY_DO}', qw)
['hoho', 'hihi', 'haha', 'rara', 'a', 'rere', 'titi', 'so', 'whdhdskhds', 'wekjewhkwqj', 'djfjfj', 'so' 'kjkfdjkdnekjdhide', 'b', 'kdjkdld']
Closest I've come is so inefficient:
>>> qw ='hoho hihi haha hede rara a rere titi so whdhdskhdshede wekjewhkwqjhededjfjfj so kjkfdjkdnekjdhide b hede kdjkdld'
>>> re.findall('[\S]+(?=hede)|(?<=hede )[\S]+|(?<=hede)[\S]+|[\S]+(?= hede)|[\S]+(?=hede )|(?<= hede)[\S]+', qw)
['haha', 'rara', 'whdhdskhds', 'wekjewhkwqj', 'djfjfj', 'b', 'kdjkdld']
Keep in mind that qw features a single space between the terms. I couldn't help but wondering if a solution would have been possible if there were variances in spacing i.e. if qw had equaled the below:
>>> qw = "hoho hihi haha hede rara a rere titi so whdhdskhdshede wekjewhkwqjhededjfjfj so kjkfdjkdnekjdhide b hede kdjkdld"
.
Thank you guys for all of the help.
Also, in every thread I've read a variation on "^(?!hede).*$" or "^(?!.foo)." has come up for multiline posts. This doesn't work well in Python of course, but I've tried fooling around with these to no avail.
Thank you guys so much for the help!
I suggest leveraging re.findall
feature that is returning only captured texts:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
So, you can match and capture what you need and just match what you need to skip. See the Python demo:
import re
qw ='hoho hihi haha hede rara a rere titi so whdhdskhdshede wekjewhkwqjhededjfjfj so kjkfdjkdnekjdhide b hede kdjkdld'
rx = r'hede|((?:(?!hede)\S)+)'
results = re.findall(rx, qw)
print(filter(None, results))
# => ['hoho', 'hihi', 'haha', 'rara', 'a', 'rere', 'titi', 'so', 'whdhdskhds', 'wekjewhkwqj', 'djfjfj', 'so', 'kjkfdjkdnekjdhide', 'b', 'kdjkdld']
See the Python demo.
Since the hede
is not captured, it is not returned, but since there is 1 capturing group and it is not participating in the match, an empty string is added to the resulting list every time the non-captured pattern matches.
Pattern details
hede
- match hede
|
- or((?:(?!hede)\S)+)
- match and capture into Group 1 one or more non-whitespace chars that are not the starting point for a hede
sequence.Note that in case you use PyPi regex
modile, you may use the PCRE-like verbs (*SKIP)(*F)
:
>>> import regex
>>> qw ='hoho hihi haha hede rara a rere titi so whdhdskhdshede wekjewhkwqjhededjfjfj so kjkfdjkdnekjdhide b hede kdjkdld'
>>> print(regex.findall(r'hede(*SKIP)(*F)|((?:(?!hede)\S)+)', qw))
['hoho', 'hihi', 'haha', 'rara', 'a', 'rere', 'titi', 'so', 'whdhdskhds', 'wekjewhkwqj', 'djfjfj', 'so', 'kjkfdjkdnekjdhide', 'b', 'kdjkdld']
Then, there is no need to filter
the results.