python html parsing web-scraping beautifulsoup

How to extract tags from HTML using Beautifulsoup in Python

I am trying to parse through an HTML page which simplified looks like this:

<div class="anotherclass part"
  <a href="http://example.com" >
    <div class="column abc"><strike>&#163;3.99</strike><br>&#163;3.59</div>
    <div class="column def"></div>
    <div class="column ghi">1 Feb 2013</div>
    <div class="column jkl">
      <h4>A title</h4>
      <p>
        <img class="image" src="http://example.com/image.jpg">A, List, Of, Terms, To, Extract - 1 Feb 2013</p>
    </div>
  </a>
</div>

I am a beginner at coding python and I have read and re-read the beautifulsoup documentation at http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

I have got this code:

from BeautifulSoup import BeautifulSoup

with open("file.html") as fp:
  html = fp.read()

soup = BeautifulSoup(html)

parts = soup.findAll('a', attrs={"class":re.compile('part'), re.IGNORECASE} )
for part in parts:
  mypart={}

  # ghi
  mypart['ghi'] = part.find(attrs={"class": re.compile('ghi')} ).string
  # def
  mypart['def'] = part.find(attrs={"class": re.compile('def')} ).string
  # h4
  mypart['title'] = part.find('h4').string

  # jkl
  mypart['other'] = part.find('p').string

  # abc
  pattern = re.compile( r'\&\#163\;(\d{1,}\.?\d{2}?)' )
  theprices = re.findall( pattern, str(part) )
  if len(theprices) == 2:
    mypart['price'] = theprices[1]
    mypart['rrp'] = theprices[0]
  elif len(theprices) == 1:
    mypart['price'] = theprices[0]
    mypart['rrp'] = theprices[0]
  else:
    mypart['price'] = None
    mypart['rrp'] = None

I want to extract any text from the classes def and ghi which I think my script does correctly.

I also want to extract the two prices from abc which my script does in a rather clunky fashion at the moment. Sometimes there are two prices, sometimes one and sometimes none in this part.

Finally I want to extract the "A, List, Of, Terms, To, Extract" part from class jkl which my script fails to do. I thought getting the string part of the p tag would work but I cannot understand why it does not. The date in this part always matches the date in class ghi so it should be easy to replace/remove it.

Any advice? Thank-you!

Solution

First, if you add convertEntities=bs.BeautifulSoup.HTML_ENTITIES to

soup = bs.BeautifulSoup(html, convertEntities=bs.BeautifulSoup.HTML_ENTITIES)

then the html entities such as £ will be converted to their corresponding unicode character, such as £. This will allow you to use a simpler regex to identify the prices.

Now, given part, you can find the text content in the <div> with the prices using its contents attribute:

In [37]: part.find(attrs={"class": re.compile('abc')}).contents
Out[37]: [<strike>£3.99</strike>, <br />, u'\xa33.59']

All we need to do is extract the number from each item, or skip it if there is no number:

def parse_price(text):
    try:
        return float(re.search(r'\d*\.\d+', text).group())
    except (TypeError, ValueError, AttributeError):
        return None

price = []
for item in part.find(attrs={"class": re.compile('abc')}).contents:
    item = parse_price(item.string)
    if item:
        price.append(item)

At this point price will be a list of 0, 1, or 2 floats. We would like to say

mypart['rrp'], mypart['price'] = price

but that would not work if price is [] or contains only one item.

Your method of handling the three cases with if..else is okay -- it is the most straightforward and arguably the most readable way to proceed. But it is also a bit mundane. If you'd like something a little more terse you could do the following:

Since we want to repeat the same price if price contains only one item, you might be led to think about itertools.cycle.

In the case where price is the empty list, [], we want itertools.cycle([None]), but otherwise we could use itertools.cycle(price).

So to combine both cases into one expression, we could use

price = itertools.cycle(price or [None])
mypart['rrp'], mypart['price'] = next(price), next(price)

The next function peels off the values in the iterator price one by one. Since price is cycling through its values, it will never end; it will just keep yielding the values in sequence and then starting over again if necessary -- which is just what we want.

The A, List, Of, Terms, To, Extract - 1 Feb 2013 could be obtained again through the use of the contents attribute:

# jkl
mypart['other'] = [item for item in part.find('p').contents
                   if not isinstance(item, bs.Tag) and item.string.strip()]

So, the full runnable code would look like:

import BeautifulSoup as bs
import os
import re
import itertools as IT

def parse_price(text):
    try:
        return float(re.search(r'\d*\.\d+', text).group())
    except (TypeError, ValueError, AttributeError):
        return None

filename = os.path.expanduser("~/tmp/file.html")
with open(filename) as fp:
    html = fp.read()

soup = bs.BeautifulSoup(html, convertEntities=bs.BeautifulSoup.HTML_ENTITIES)

for part in soup.findAll('div', attrs={"class": re.compile('(?i)part')}):
    mypart = {}
    # abc
    price = []
    for item in part.find(attrs={"class": re.compile('abc')}).contents:
        item = parse_price(item.string)
        if item:
            price.append(item)

    price = IT.cycle(price or [None])
    mypart['rrp'], mypart['price'] = next(price), next(price)

    # jkl
    mypart['other'] = [item for item in part.find('p').contents
                       if not isinstance(item, bs.Tag) and item.string.strip()]

    print(mypart)

which yields

{'price': 3.59, 'other': [u'A, List, Of, Terms, To, Extract - 1 Feb 2013'], 'rrp': 3.99}