Search code examples
pythonregexregex-lookaroundsregex-groupregex-greedy

RegEx for extracting domains and subdomains


I'm trying to strip a bunch of websites down to their domain names i.e:

https://www.facebook.org/hello 

becomes facebook.org.

I'm using the regex pattern finder:

(https?:\/\/)?([wW]{3}\.)?([\w]*.\w*)([\/\w]*)

This catches most cases but occasionally there will be websites such as:

http://www.xxxx.wordpress.com/hello

which I want to strip to xxxx.wordpress.com.

How can I identify those cases while still identifying all other normal entries?


Solution

  • You expression seems to be working perfectly fine and it outputs what you might want to. I only added an i flag and slightly modify it to:

    (https?:\/\/)?([w]{3}\.)?(\w*.\w*)([\/\w]*)
    

    RegEx

    If this wasn't your desired expression, you can modify/change your expressions in regex101.com.

    enter image description here

    RegEx Circuit

    You can also visualize your expressions in jex.im:

    enter image description here

    Python Code

    # coding=utf8
    # the above tag defines encoding for this document and is for Python 2.x compatibility
    
    import re
    
    regex = r"(https?:\/\/)?([w]{3}\.)?(\w*.\w*)([\/\w]*)"
    
    test_str = ("https://www.facebook.org/hello\n"
        "http://www.xxxx.wordpress.com/hello\n"
        "http://www.xxxx.yyy.zzz.wordpress.com/hello")
    
    subst = "\\3"
    
    # You can manually specify the number of replacements by changing the 4th argument
    result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
    
    if result:
        print (result)
    
    # Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
    

    JavaScript Demo

    const regex = /(https?:\/\/)?([w]{3}\.)?(\w*.\w*)([\/\w]*)/gmi;
    const str = `https://www.facebook.org/hello
    http://www.xxxx.wordpress.com/hello
    http://www.xxxx.yyy.zzz.wordpress.com/hello`;
    const subst = `$3`;
    
    // The substituted value will be contained in the result variable
    const result = str.replace(regex, subst);
    
    console.log('Substitution result: ', result);