Search code examples
pythonregexpython-2.7regex-groupregex-greedy

RegEx for matching URLs in Python


I have this example string:

line = '[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end'

I need to extract the path (without slashes) before "marker needle". The following works to list all paths:

print re.findall('https://www\\.myurl\\.com/(.+?)/', line)
# ['test1', 'test2', 'test3']

However, when I change it to only find the path I want (the one before "marker needle"), it gives a weird output:

print re.findall('https://www\\.myurl\\.com/(.+?)/ marker needle', line)
# ['test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3']

My expected output:

test3

I have tried the same with re.search but the result is the same.


Solution

  • This expression has three capturing groups, where the second one has our desired output:

    (https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)
    

    This tool helps us to modify/change the expression, if you wish.

    enter image description here

    RegEx Descriptive Graph

    jex.im visualizes regular expressions:

    enter image description here

    Python Test

    # -*- coding: UTF-8 -*-
    import re
    
    string = "[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end"
    expression = r'(https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)'
    match = re.search(expression, string)
    if match:
        print("YAAAY! \"" + match.group(2) + "\" is a match 💚💚💚 ")
    else: 
        print('🙀 Sorry! No matches!')
    

    Output

    YAAAY! "test3" is a match 💚💚💚
    

    Performance Test

    This snippet returns the runtime of a 1-million times for loop.

    const repeat = 10;
    const start = Date.now();
    
    for (var i = repeat; i >= 0; i--) {
    	const regex = /(.*)(https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)(.*)/gm;
    	const str = "[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end";
    	const subst = `$3`;
    
    	var match = str.replace(regex, subst);
    }
    
    const end = Date.now() - start;
    console.log("YAAAY! \"" + match + "\" is a match 💚💚💚 ");
    console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");