Search code examples
pythonhtmlparsingweb-scrapingbeautifulsoup

How to extract tags from HTML using Beautifulsoup in Python


I am trying to parse through an HTML page which simplified looks like this:

<div class="anotherclass part"
  <a href="http://example.com" >
    <div class="column abc"><strike>&#163;3.99</strike><br>&#163;3.59</div>
    <div class="column def"></div>
    <div class="column ghi">1 Feb 2013</div>
    <div class="column jkl">
      <h4>A title</h4>
      <p>
        <img class="image" src="http://example.com/image.jpg">A, List, Of, Terms, To, Extract - 1 Feb 2013</p>
    </div>
  </a>
</div>

I am a beginner at coding python and I have read and re-read the beautifulsoup documentation at http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

I have got this code:

from BeautifulSoup import BeautifulSoup

with open("file.html") as fp:
  html = fp.read()

soup = BeautifulSoup(html)

parts = soup.findAll('a', attrs={"class":re.compile('part'), re.IGNORECASE} )
for part in parts:
  mypart={}

  # ghi
  mypart['ghi'] = part.find(attrs={"class": re.compile('ghi')} ).string
  # def
  mypart['def'] = part.find(attrs={"class": re.compile('def')} ).string
  # h4
  mypart['title'] = part.find('h4').string

  # jkl
  mypart['other'] = part.find('p').string

  # abc
  pattern = re.compile( r'\&\#163\;(\d{1,}\.?\d{2}?)' )
  theprices = re.findall( pattern, str(part) )
  if len(theprices) == 2:
    mypart['price'] = theprices[1]
    mypart['rrp'] = theprices[0]
  elif len(theprices) == 1:
    mypart['price'] = theprices[0]
    mypart['rrp'] = theprices[0]
  else:
    mypart['price'] = None
    mypart['rrp'] = None

I want to extract any text from the classes def and ghi which I think my script does correctly.

I also want to extract the two prices from abc which my script does in a rather clunky fashion at the moment. Sometimes there are two prices, sometimes one and sometimes none in this part.

Finally I want to extract the "A, List, Of, Terms, To, Extract" part from class jkl which my script fails to do. I thought getting the string part of the p tag would work but I cannot understand why it does not. The date in this part always matches the date in class ghi so it should be easy to replace/remove it.

Any advice? Thank-you!


Solution

  • First, if you add convertEntities=bs.BeautifulSoup.HTML_ENTITIES to

    soup = bs.BeautifulSoup(html, convertEntities=bs.BeautifulSoup.HTML_ENTITIES)
    

    then the html entities such as &#163; will be converted to their corresponding unicode character, such as £. This will allow you to use a simpler regex to identify the prices.


    Now, given part, you can find the text content in the <div> with the prices using its contents attribute:

    In [37]: part.find(attrs={"class": re.compile('abc')}).contents
    Out[37]: [<strike>£3.99</strike>, <br />, u'\xa33.59']
    

    All we need to do is extract the number from each item, or skip it if there is no number:

    def parse_price(text):
        try:
            return float(re.search(r'\d*\.\d+', text).group())
        except (TypeError, ValueError, AttributeError):
            return None
    
    price = []
    for item in part.find(attrs={"class": re.compile('abc')}).contents:
        item = parse_price(item.string)
        if item:
            price.append(item)
    

    At this point price will be a list of 0, 1, or 2 floats. We would like to say

    mypart['rrp'], mypart['price'] = price
    

    but that would not work if price is [] or contains only one item.

    Your method of handling the three cases with if..else is okay -- it is the most straightforward and arguably the most readable way to proceed. But it is also a bit mundane. If you'd like something a little more terse you could do the following:

    Since we want to repeat the same price if price contains only one item, you might be led to think about itertools.cycle.

    In the case where price is the empty list, [], we want itertools.cycle([None]), but otherwise we could use itertools.cycle(price).

    So to combine both cases into one expression, we could use

    price = itertools.cycle(price or [None])
    mypart['rrp'], mypart['price'] = next(price), next(price)
    

    The next function peels off the values in the iterator price one by one. Since price is cycling through its values, it will never end; it will just keep yielding the values in sequence and then starting over again if necessary -- which is just what we want.


    The A, List, Of, Terms, To, Extract - 1 Feb 2013 could be obtained again through the use of the contents attribute:

    # jkl
    mypart['other'] = [item for item in part.find('p').contents
                       if not isinstance(item, bs.Tag) and item.string.strip()]
    

    So, the full runnable code would look like:

    import BeautifulSoup as bs
    import os
    import re
    import itertools as IT
    
    def parse_price(text):
        try:
            return float(re.search(r'\d*\.\d+', text).group())
        except (TypeError, ValueError, AttributeError):
            return None
    
    filename = os.path.expanduser("~/tmp/file.html")
    with open(filename) as fp:
        html = fp.read()
    
    soup = bs.BeautifulSoup(html, convertEntities=bs.BeautifulSoup.HTML_ENTITIES)
    
    for part in soup.findAll('div', attrs={"class": re.compile('(?i)part')}):
        mypart = {}
        # abc
        price = []
        for item in part.find(attrs={"class": re.compile('abc')}).contents:
            item = parse_price(item.string)
            if item:
                price.append(item)
    
        price = IT.cycle(price or [None])
        mypart['rrp'], mypart['price'] = next(price), next(price)
    
        # jkl
        mypart['other'] = [item for item in part.find('p').contents
                           if not isinstance(item, bs.Tag) and item.string.strip()]
    
        print(mypart)
    

    which yields

    {'price': 3.59, 'other': [u'A, List, Of, Terms, To, Extract - 1 Feb 2013'], 'rrp': 3.99}