Search code examples
pythonweb-scrapingbeautifulsoup

Filter just the id number from a URL in BeautifulSoup


I've gotten to a point where

print(soup.td.a)

results in

<a href="/?p=section&amp;a=details&amp;id=37627">Some Text Here</a>

I'm trying to figure out how I can filter further so all that results is

37627

I've tried a number of things including urlparse and re.compile but I'm just not getting the syntax correct. Plus I feel like there is probably an easier way that I'm just not finding. I appreciate any help given.


Solution

  • You can use the parse_qs() method to parse queries:

    
    from bs4 import BeautifulSoup
    from urllib.parse import urlparse, parse_qs
    
    html_content = '''
    <td>
        <a href="/?p=section&amp;a=details&amp;id=37627">Some Text Here</a>
    </td>
    '''
    
    # Parse the HTML content
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # Find the <a> tag
    a_tag = soup.find('a')
    
    # Extract the href attribute
    href = a_tag.get('href')
    
    # Parse the URL to get the query parameters
    parsed_url = urlparse(href)
    # for py2: parsed_url = urlparse.urlparse(url)
    query_params = parse_qs(parsed_url.query)
    
    # Get the 'id' parameter
    id_value = query_params.get('id', [None])[0]
    
    print(id_value)  # Output: 37627