I am trying to parse through an HTML page which simplified looks like this:
<div class="anotherclass part"
<a href="http://example.com" >
<div class="column abc"><strike>£3.99</strike><br>£3.59</div>
<div class="column def"></div>
<div class="column ghi">1 Feb 2013</div>
<div class="column jkl">
<h4>A title</h4>
<p>
<img class="image" src="http://example.com/image.jpg">A, List, Of, Terms, To, Extract - 1 Feb 2013</p>
</div>
</a>
</div>
I am a beginner at coding python and I have read and re-read the beautifulsoup documentation at http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
I have got this code:
from BeautifulSoup import BeautifulSoup
with open("file.html") as fp:
html = fp.read()
soup = BeautifulSoup(html)
parts = soup.findAll('a', attrs={"class":re.compile('part'), re.IGNORECASE} )
for part in parts:
mypart={}
# ghi
mypart['ghi'] = part.find(attrs={"class": re.compile('ghi')} ).string
# def
mypart['def'] = part.find(attrs={"class": re.compile('def')} ).string
# h4
mypart['title'] = part.find('h4').string
# jkl
mypart['other'] = part.find('p').string
# abc
pattern = re.compile( r'\&\#163\;(\d{1,}\.?\d{2}?)' )
theprices = re.findall( pattern, str(part) )
if len(theprices) == 2:
mypart['price'] = theprices[1]
mypart['rrp'] = theprices[0]
elif len(theprices) == 1:
mypart['price'] = theprices[0]
mypart['rrp'] = theprices[0]
else:
mypart['price'] = None
mypart['rrp'] = None
I want to extract any text from the classes def
and ghi
which I think my script does correctly.
I also want to extract the two prices from abc
which my script does in a rather clunky fashion at the moment. Sometimes there are two prices, sometimes one and sometimes none in this part.
Finally I want to extract the "A, List, Of, Terms, To, Extract"
part from class jkl
which my script fails to do. I thought getting the string part of the p
tag would work but I cannot understand why it does not. The date in this part always matches the date in class ghi
so it should be easy to replace/remove it.
Any advice? Thank-you!
First, if you add convertEntities=bs.BeautifulSoup.HTML_ENTITIES
to
soup = bs.BeautifulSoup(html, convertEntities=bs.BeautifulSoup.HTML_ENTITIES)
then the html entities such as £
will be converted to their corresponding unicode character, such as £
. This will allow you to use a simpler regex to identify the prices.
Now, given part
, you can find the text content in the <div>
with the prices using its contents
attribute:
In [37]: part.find(attrs={"class": re.compile('abc')}).contents
Out[37]: [<strike>£3.99</strike>, <br />, u'\xa33.59']
All we need to do is extract the number from each item, or skip it if there is no number:
def parse_price(text):
try:
return float(re.search(r'\d*\.\d+', text).group())
except (TypeError, ValueError, AttributeError):
return None
price = []
for item in part.find(attrs={"class": re.compile('abc')}).contents:
item = parse_price(item.string)
if item:
price.append(item)
At this point price
will be a list of 0, 1, or 2 floats.
We would like to say
mypart['rrp'], mypart['price'] = price
but that would not work if price
is []
or contains only one item.
Your method of handling the three cases with if..else
is okay -- it is the most straightforward and arguably the most readable way to proceed. But it is also a bit mundane. If you'd like something a little more terse you could do the following:
Since we want to repeat the same price if price
contains only one item, you might be led to think about itertools.cycle.
In the case where price
is the empty list, []
, we want itertools.cycle([None])
, but otherwise we could use itertools.cycle(price)
.
So to combine both cases into one expression, we could use
price = itertools.cycle(price or [None])
mypart['rrp'], mypart['price'] = next(price), next(price)
The next
function peels off the values in the iterator price
one by one. Since price
is cycling through its values, it will never end; it will just keep yielding the values in sequence and then starting over again if necessary -- which is just what we want.
The A, List, Of, Terms, To, Extract - 1 Feb 2013
could be obtained again through the use of the contents
attribute:
# jkl
mypart['other'] = [item for item in part.find('p').contents
if not isinstance(item, bs.Tag) and item.string.strip()]
So, the full runnable code would look like:
import BeautifulSoup as bs
import os
import re
import itertools as IT
def parse_price(text):
try:
return float(re.search(r'\d*\.\d+', text).group())
except (TypeError, ValueError, AttributeError):
return None
filename = os.path.expanduser("~/tmp/file.html")
with open(filename) as fp:
html = fp.read()
soup = bs.BeautifulSoup(html, convertEntities=bs.BeautifulSoup.HTML_ENTITIES)
for part in soup.findAll('div', attrs={"class": re.compile('(?i)part')}):
mypart = {}
# abc
price = []
for item in part.find(attrs={"class": re.compile('abc')}).contents:
item = parse_price(item.string)
if item:
price.append(item)
price = IT.cycle(price or [None])
mypart['rrp'], mypart['price'] = next(price), next(price)
# jkl
mypart['other'] = [item for item in part.find('p').contents
if not isinstance(item, bs.Tag) and item.string.strip()]
print(mypart)
which yields
{'price': 3.59, 'other': [u'A, List, Of, Terms, To, Extract - 1 Feb 2013'], 'rrp': 3.99}