Search code examples
pythonparsingurlspliturlparse

parsing a url in python with changing part in it


I'm parsing a url in Python, below you can find a sample url and the code, what i want to do is splitting the (74743) from the url and make a for loop which will be taking it from a parts list. Tried to use urlparse but couldn't complete it to the end mostly because of the changing parts in the url. Ijust want the easiest and fastest way to do this.

Sample url:

http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is=

(http://example.com/wps/portal) Always fixed

(lYuxDoIwGAYf6f9aqKSjMNQ) Always changing

(74743) Will be taken from a list name Parts

(IntNumberOf=&is=) Also changing depending on the section of the website

Here's the Code:

from lxml import html
import requests
import urlparse


Parts = [74743, 85731, 93021]

url = 'http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is='

parsing = urlparse.urlsplit(url)

print parsing

Solution

  • >>> import urlparse
    
    >>> url = 'http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is='
    
    >>> split_url = urlparse.urlsplit(url)
    >>> split_url.path
    '/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/'
    

    You can split the path into a list of strings using '/', slice the list, and re-join:

    >>> path = split_url.path
    >>> path.split('/')
    ['', 'wps', 'portal', 'lYuxDoIwGAYf6f9aqKSjMNQ', '']
    

    Slice off the last two:

    >>> path.split('/')[:-2]
    ['', 'wps', 'portal']
    

    And re-join:

    >>> '/'.join(path.split('/')[:-2])
    '/wps/portal'
    

    To parse the query, use parse_qs:

    >>> parsed_query = urlparse.parse_qs(split_url.query)
    {'PartNo': ['74743']}
    

    To keep the empty parameters use keep_blank_values=True:

    >>> query = urlparse.parse_qs(split_url.query, keep_blank_values=True)
    >>> query
    {'PartNo': ['74743'], 'is': [''], 'IntNumberOf': ['']}
    

    You can then modify the query dictionary:

    >>> query['PartNo'] = 85731
    

    And update the original split_url:

    >>> updated = split_url._replace(path='/'.join(base_path.split('/')[:-2] +
                                                  ['ASDFZXCVQWER', '']),
                                    query=urllib.urlencode(query, doseq=True))
    
    >>> urlparse.urlunsplit(updated)
    'http://example.com/wps/portal/ASDFZXCVQWER/?PartNo=85731&IntNumberOf=&is='