Search code examples
pythonregexstringpython-re

RegEx for three tricky string patterns


I want to find out an Instagram username from a profile page.

The thing is users choose how to address their usernames. (So, it is tricky to make computer get the pattern with RegEx)

All of patterns I want to search are shown below (user posts their Instagram username using one of them):

  • IG: @user-name
  • I.G.: @user-name
  • Instagram: @user-name

I thought of this logic below but I got completely lost searching in RegEx documentation or examples suitable for this search.

My logic: ignorecase (IG or I.G. or I.G or instagram) + (possible space) + (possible :) + (possible space) + (possible @) + (username with - or _ in it) + (ends with space or new line or full stop)

In a word, I'd like to select a word(username) after "instagram" or "IG" or "I.G" excluding unnecessary characters like ":", "@" or spaces.

How can I do this in RegEx? One-liner might be an efficient, yet elegant answer.

P.S. I want to do this with Python re.


Solution

  • My logic: ignorecase(IG or I.G. or I.G or instagram) + (possible space) + (possible :) + (possible space) + (possible @) + (username with - or _ in it) + (ends with space or new line or full stop)

    First, on prefix part (IG and Instagram:). You can use re.I or re.IGNORECASE argument on re.compile function to ignore cases, on I.G and instagram. Then use the | or the or on regex terms.

    r'(instagram|I\.*G\.*)'
    

    Then escape the . and use the question mark ? which indicates that it can either have one or none, also on possible space \s and possible colon :.

    prefix = re.compile(r'(instagram|I\.*G\.*)\s?:?', re.IGNORECASE)
    

    And then for the username. First, use the question mark ? on @ to indicate that it is optional. Then the two (.*) are the first and last (if any) part of the username separated by either dash or underscore (-|_)? which is also optional. username = re.compile(r'@?(.)(-|_)?(.)\s?$') Placing it altogether:

    username_regex = re.compile(r'^(instagram|I\.?G\.?)\s?:?\s?(@?.*((-|_).*)?\s?)$', re.IGNORECASE)
    

    I've performed some tests for this regex, here is the code.

    import re
    
    username_regex = re.compile(r'^(instagram|I\.?G\.?)\s?:?\s?(@?.*((-|_).*)?\s?)$', re.IGNORECASE)
    
    tests = [
        'I.G.: @first-last',
        'I.G: @first-last',
        'I.g: @first-last',
        'I.g.: @first-last',
        'i.G: @first-last',
        'i.G.: @n-last',
        'i.g: @first-last',
        'i.g. @first-last',
        'I.G.:@first-last',
        'I.G@first-last',
        'I.g @first-last',
        'I.gfirst-last',
        'i.G: first_last',
        'i.G. first_last',
        'ig: first_last',
        'i.g. @first-last',
        'inStagram: @first-last',
        'instAgram: @first-last',
        'INSTAGRAM: @first-last',
    ]
    
    not_matched = 0
    for test in tests:
        searched = username_regex.search(test)
    
        if searched:
            print("MATCH ->", test)
            print(searched.group(), '\n\n')
        else:
            print("========", test)
            not_matched += 1
    
    print(not_matched)
    # >> 0
    

    If you want to get the prefix and username, you can use the group() and groups() method. For example

    searched.groups()
    # ('I.G:', '@first-last', None, None)
    
    searched.group(0) # 'I.G: @first-last'
    
    # If you want to get the prefix
    searched.group(1) # 'I.G:'
    
    # If you want to get the username
    searched.group(2) # '@first-last' 
    

    NOTE: It is possible that I am wrong here somewhere, if you found something wrong, please let me know. Thanks.