I've implemented non-greedy regex on a group of string URLs, where I'm trying to clean them up so that they end after the .com (.co.uk etc). Some of them continued with '
or "
or <
after the desired cutoff, and so I used x = re.findall('([A-Za-z0-9]+@\S+.co\S*?)[\'"<]', finalSoup2)
.
The problem is that some URLs are [email protected]'misc''misc' (or similar with < >) and so after implementing the non-greedy regex I'm still left with [email protected]">[email protected]
, for example.
I've tried two ??
's together, but obviously not working, so what's they proper way to acheive clean URLs in this situation?
The issue with your regex is that you currently are only looking for Non-spaces(period)co instead of looking for Non-spaces(period)Non-spaces.
So in this case you could get away with the following regex based on the information above.
>>> finalSoup2 = """
... [email protected]'misc''misc
... [email protected]">[email protected]
... google.com
... google.co.uk"'<>Stuff
... """
>>>x = re.findall('([A-Za-z0-9]+@[^\'"<>]+)[\'"<]', finalSoup2)
>>>x
['[email protected]',
'[email protected]',
'[email protected]\ngoogle.com\ngoogle.co.uk']
Which you can then use to get the urls that you'd like but you'd have to make sure to split them on r'\n'
as they may have a newline character within the text as seen above.