I am trying to use BeautifulSoup 4 to extract text from specific tags in an HTML Document. I have HTML that has a bunch of div tags like the following:
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:42px; top:90px; width:195px; height:24px;">
<span style="font-family: FIPXQM+Arial-BoldMT; font-size:12px">
Futures Daily Market Report for Financial Gas
<br/>
21-Jul-2015
<br/>
</span>
</div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:54px; top:135px; width:46px; height:10px;">
<span style="font-family: FIPXQM+Arial-BoldMT; font-size:10px">
COMMODITY
<br/>
</span>
</div>
I am trying to get the text from all span tags that are in any div tag that has a style of "left:54px".
I can get a single div if i use:
soup = BeautifulSoup(open(extracted_html_file))
print soup.find_all('div',attrs={"style":"position:absolute; border: textbox 1px solid; "
"writing-mode:lr-tb; left:42px; top:90px; "
"width:195px; height:24px;"})
It returns:
[<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:42px; top:90px; width:195px; height:24px;"><span style="font-family: FIPXQM+Arial-BoldMT; font-size:12px">Futures Daily Market Report for Financial Gas
<br/>21-Jul-2015
<br/></span></div>]
But that only gets me the one div that exactly matches that styling. I want all divs that match only the "left:54px" style.
To do this, I've tried a few different ways:
soup = BeautifulSoup(open(extracted_html_file))
print soup.find_all('div',style='left:54px')
print soup.find_all('div',attrs={"style":"left:54px"})
print soup.find_all('div',attrs={"left":"54px"})
But all these print statements return empty lists.
Any Ideas?
You can pass in a regular expression instead of a string according to the documentation here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments
So I would try this:
import re
soup = BeautifulSoup(open(extracted_html_file))
soup.find_all('div', style = re.compile('left:54px'))