Search code examples
pythonregexurltor

Method for identifying .onion links in text?


how can I identify .onion links in a text bearing in mind they can come in a variety of way;

hfajlhfjkdsflkdsja.onion
http://hfajlhfjkdsflkdsja.onion
http://www.hfajlhfjkdsflkdsja.onion

I'm thinking of regex but (.*?.onion) would return the whole paragraph where the URL Link is buried in


Solution

  • This will do it: (?:https?://)?(?:www)?(\S*?\.onion)\b (Added non-capturing groups - credit: @WiktorStribiżew)

    Demo:

    s = '''hfajlhfjkdsflkdsja.onion
    https://hfajlhfjkdsflkdsja.onion
    http://www.hfajlhfjkdsflkdsja.onion
    https://www.google.com
    https://stackoverflow.com'''
    
    
    for m in re.finditer(r'(?:https?://)?(?:www)?(\S*?\.onion)\b', s, re.M | re.IGNORECASE):
        print(m.group(0))
    

    Output

    hfajlhfjkdsflkdsja.onion
    https://hfajlhfjkdsflkdsja.onion
    http://www.hfajlhfjkdsflkdsja.onion