Search code examples
pythonregexurlpython-re

Problem while using re.search in url detection in python


I'm using re.search to replace URL in strings with a placeholder '{url_object}'. This is my code :

def url_detector(text):
    urls = re.findall(r"(https?\:\/\/[\w\d:#@%\/;$()~_?\+-=\\\.&]*)", text)
    if len(urls)>0:
        for url in urls:
            span = re.search(url, text).span()
            text = text[:(span[0])] + '{url_object}' + text[span[1]:]
    return text

text with URLs that I used is as follows:

text_list = ["google url is https://www.google.com/. Everyone frequently uses it",
             "https://www.google.com/search?q=simple+search&oq=simple+search&aqs=chrome..69i57j0l9.2908j0j7&sourceid=chrome&ie=UTF-8 is the url for simplesearch",
            "url for today's news : https://www.google.com/search?q=news+today&sxsrf=ALeKk00r1fVK6JeIaO1bhigZSu8IEGjgQw%3A1617353154494&ei=wtlmYIrXHdXez7sP6v-XwAw&oq=news+today&gs_lcp=Cgdnd3Mtd2l6EAMyCggAELEDEIMBEEMyCAgAELEDEIMBMggIABCxAxCDATIICAAQsQMQgwEyCAgAELEDEIMBMggIABCxAxCDATIICAAQsQMQgwEyCAgAELEDEIMBMggIABCxAxCDATICCAA6BwgAEEcQsAM6BwgAELADEEM6CgguELADEMgDEEM6BAgAEEM6BwgAELEDEEM6BQgAELEDSgUIOBIBMVDUBliRFGCzGGgBcAJ4AIABnAGIAacHkgEDMC43mAEAoAEBqgEHZ3dzLXdpesgBC8ABAQ&sclient=gws-wiz&ved=0ahUKEwiKwP-Blt_vAhVV73MBHer_BcgQ4dUDCA0&uact=5, date = 02/04/2021",
            "sample url = https://www.google.com/search?q=sample&sxsrf=ALeKk02uixAiZMyqhMtSZZwbeYefHRutGQ%3A1617353222151&ei=BtpmYKfTCJKD4-EPlLWVeA&oq=sample&gs_lcp=Cgdnd3Mtd2l6EAMyBAgjECcyBQgAELEDMgUIABCxAzIECAAQQzIFCAAQsQMyBAgAEEMyAggAMgUIABCxAzIFCAAQsQMyBQgAELEDOgcIABBHELADOgcIABCwAxBDOgcIABCHAhAUUKERWN4UYMIYaAFwAngAgAGWAYgB9ASSAQMwLjWYAQCgAQGqAQdnd3Mtd2l6yAEKwAEB&sclient=gws-wiz&ved=0ahUKEwin7qCilt_vAhWSwTgGHZRaBQ8Q4dUDCA0&uact=5"]

I tried url_detector on the above list

for text in text_list:
    print(url_detector(text))

While I expected an output that looks like this :

google url is {url_object}. Everyone frequently uses it
{url_object} is the url for simple search
url for today's news : {url_object}, date = 02/04/2021
sample url = {url_object}

I got this:

google url is {url_object}. Everyone frequently uses it
'NoneType' object has no attribute 'span'

It appears that this is happening due to the presence of '?' in the URLs obtained from re.findall.

This may be because re is treating '?' for its special meaning. So, I tried replacing '?' with '\?' to make it work. But '?' is being replaced with '\\?'. When such a pattern is used with re.search(), it's generating the error :

error: bad escape (end of pattern) at position 29.

Any ideas on how I could solve this? Thanks in advance.


Solution

  • Try to enclose url in re.escape() in the sentence span = re.search(url, text).span(), like below:

    span = re.search(re.escape(url), text).span()
    

    The reason is because your extracted results in the first re.findall() contain some special characters e.g. ? that the regex engine would regard as special regex tokens. Therefore, even when you later on use re.search() to search results already matched, it still gets mismatch (hence return NoneType object) because of these special characters misinterpreted by regex engine.

    re.escape():

    Escape special characters in pattern. This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.