Search code examples
pythonurl

How parse urls created by a search bar query into url and search phrase?


I want to parse the url generated as the result of search querying into its components.

The following urls is generated as a result of searching on goolge search with the query phrase "url".

https://www.google.com/search?q=url&rlz=1C5GCEM_enIN1101IN1101&oq=url&gs_lcrp=EgZjaHJvbWUyFAgAEEUYORhDGIMBGLEDGIAEGIoFMg4IARBFGCcYOxiABBiKBTIMCAIQABhDGIAEGIoFMgYIAxBFGDwyBggEEEUYPDIGCAUQRRg8MgYIBhBFGDwyBggHEEUYPNIBCDU1NjhqMGo3qAIAsAIA&sourceid=chrome&ie=UTF-8

How can I categorize into url "www.google.com" plus "query":"url?

Are there packages in python does this parsing?

Input: " https://www.google.com/search?q=url&rlz=1C5GCEM_enIN1101IN1101&oq=url&gs_lcrp=EgZjaHJvbWUyFAgAEEUYORhDGIMBGLEDGIAEGIoFMg4IARBFGCcYOxiABBiKBTIMCAIQABhDGIAEGIoFMgYIAxBFGDwyBggEEEUYPDIGCAUQRRg8MgYIBhBFGDwyBggHEEUYPNIBCDU1NjhqMGo3qAIAsAIA&sourceid=chrome&ie=UTF-8 "

Output:{ url:"www.google.com", "query":"url"}


Solution

  • You can simply use the urllib.parse module.

    >>> from urllib.parse import urlparse, parse_qs
    >>> 
    >>> 
    >>> url = "https://www.google.com/search?q=url&rlz=1C5GCEM_enIN1101IN1101&oq=url&gs_lcrp=EgZjaHJvbWUyFAgAEEUYORhDGIMBGLEDGIAEGIoFMg4IARBFGCcYOxiABBiKBTIMCAIQABhDGIAEGIoFMgYIAxBFGDwyBggEEEUYPDIGCAUQRRg8MgYIBhBFGDwyBggHEEUYPNIBCDU1NjhqMGo3qAIAsAIA&sourceid=chrome&ie=UTF-8 "
    >>> 
    >>> r = urlparse(url)   
    >>> {"url": r.hostname, "query": parse_qs(r.query).get('q')}                    
    {'url': 'www.google.com', 'query': ['url']}