Search code examples
pythondjangoparsinghtml-parsingbeautifulsoup

Parse href attribute value from element with Beautifulsoup and Mechanize


Can anyone help me traverse an html tree with beautiful soup?

I'm trying to parse through html output and after gather each value then insert into a table named Tld with python/django

<div class="rc" data-hveid="53">
<h3 class="r">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>

And only parse the value of href attribute of <a>, so only this part:

https://billing.anapp.com/

of:

<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>

I currently have:

for url in urls:
    mb.open(url)
    beautifulSoupObj = BeautifulSoup(mb.response().read())
    beautifulSoupObj.find_all('h3',attrs={'class': 'r'})

The problem is find_all above, isn't make it far enough to the <a> element.

Any help is much appreciated. Thank you.


Solution

  • from bs4 import BeautifulSoup
    
    html = """
    <div class="rc" data-hveid="53">
    <h3 class="r">
    <a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
    </h3>
    """
    
    bs = BeautifulSoup(html)
    elms = bs.select("h3.r a")
    for i in elms:
        print(i.attrs["href"])
    

    prints:

    https://billing.anapp.com/
    

    h3.r a is a css selector

    you can use css selector (i prefer them), xpath, or find in elements. the selector h3.r a will look for all h3 with class r and get from inside them the a elements. it could be a more complicated example like #an_id table tr.the_tr_class td.the_td_class it will find an id given td's inside that belong to the tr with the given class and are inside a table of course.

    this will also give you the same result. find_all returns a list of bs4.element.Tag, find_all has a recursive field not sure if you can do it in one line, i personaly prefer css selector because its easy and clean.

    for elm in  bs.find_all('h3',attrs={'class': 'r'}):
        for a_elm in elm.find_all("a"):
            print(a_elm.attrs["href"])