Search code examples
pythonregexnon-greedy

Why is non-greedy Python Regex not non-greedy enough?


I've implemented non-greedy regex on a group of string URLs, where I'm trying to clean them up so that they end after the .com (.co.uk etc). Some of them continued with ' or " or < after the desired cutoff, and so I used x = re.findall('([A-Za-z0-9]+@\S+.co\S*?)[\'"<]', finalSoup2).

The problem is that some URLs are [email protected]'misc''misc' (or similar with < >) and so after implementing the non-greedy regex I'm still left with [email protected]">[email protected], for example.

I've tried two ??'s together, but obviously not working, so what's they proper way to acheive clean URLs in this situation?


Solution

  • The issue with your regex is that you currently are only looking for Non-spaces(period)co instead of looking for Non-spaces(period)Non-spaces.

    So in this case you could get away with the following regex based on the information above.

    >>> finalSoup2 = """
    ... [email protected]'misc''misc
    ... [email protected]">[email protected]
    ... google.com
    ... google.co.uk"'<>Stuff
    ... """
    >>>x = re.findall('([A-Za-z0-9]+@[^\'"<>]+)[\'"<]', finalSoup2)
    >>>x
    ['[email protected]',
     '[email protected]',
     '[email protected]\ngoogle.com\ngoogle.co.uk']
    

    Which you can then use to get the urls that you'd like but you'd have to make sure to split them on r'\n' as they may have a newline character within the text as seen above.