Search code examples
pythonstringweb-scrapingsubstring

How to substring with specific start and end positions where a set of characters appear?


I am trying to clean the data I scraped from their links. I have over 100 links in a CSV I'm trying to clean.

This is what a link looks like in the CSV:

"https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"

I've observed that scraping this for HTML data doesn't go well and I have to get the URL present inside this. I want to get the substring which starts with &url= and ends at &ct as that's where the real URL resides.

I've read posts like this but couldn't find one for ending str too. I've tried an approach from this using the substring package but it doesn't work for more than one character.

How do I do this? Preferably without using third party packages?


Solution

  • I don't understand problem

    If you have string then you can use string- functions like .find() and slice [start:end]

    text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
    
    start = text.find('url=') + len('url=')
    end   = text.find('&ct=')
    
    text[start:end]
    

    But it may have url= and ct= in different order so better search first & after url=

    text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
    
    start = text.find('url=') + len('url=')
    end   = text.find('&', start)
    
    text[start:end]
    

    EDIT:

    There is also standard module urllib.parse to work with url - to split or join it.

    text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
    
    import urllib.parse
    
    url, query = urllib.parse.splitquery(text)
    data       = urllib.parse.parse_qs(query)
    
    data['url'][0]
    

    In data you have dictionary

    {'cd': ['SldisGkopisopiasenjA6Y28Ug'],
     'ct': ['ga'],
     'rct': ['j'],
     'sa': ['t'],
     'url': ['https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428'],
     'usg': ['AFQjaskdfYJkasKugowe896fsdgfsweF']}
    

    EDIT:

    Python shows warning that splitquery() is deprecated as of 3.8 and code should use urlparse()

    text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
    
    import urllib.parse
    
    parts = urllib.parse.urlparse(text)
    data  = urllib.parse.parse_qs(parts.query)
    
    data['url'][0]