Search code examples
pythonhtmlbeautifulsoupinline-styles

Locating tags via styles - using Python 2 and BeautifulSoup 4


I am trying to use BeautifulSoup 4 to extract text from specific tags in an HTML Document. I have HTML that has a bunch of div tags like the following:

<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:42px; top:90px; width:195px; height:24px;">
  <span style="font-family: FIPXQM+Arial-BoldMT; font-size:12px">
    Futures Daily Market Report for Financial Gas
    <br/>
    21-Jul-2015
    <br/>
   </span>
</div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:54px; top:135px; width:46px; height:10px;">
  <span style="font-family: FIPXQM+Arial-BoldMT; font-size:10px">
    COMMODITY
    <br/>
   </span>
</div>

I am trying to get the text from all span tags that are in any div tag that has a style of "left:54px".

I can get a single div if i use:

soup = BeautifulSoup(open(extracted_html_file))
print soup.find_all('div',attrs={"style":"position:absolute; border: textbox 1px solid; "
                                         "writing-mode:lr-tb; left:42px; top:90px; "
                                         "width:195px; height:24px;"})

It returns:

[<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:42px; top:90px; width:195px; height:24px;"><span style="font-family: FIPXQM+Arial-BoldMT; font-size:12px">Futures Daily Market Report for Financial Gas
<br/>21-Jul-2015
<br/></span></div>]

But that only gets me the one div that exactly matches that styling. I want all divs that match only the "left:54px" style.

To do this, I've tried a few different ways:

soup = BeautifulSoup(open(extracted_html_file))
print soup.find_all('div',style='left:54px')
print soup.find_all('div',attrs={"style":"left:54px"})
print soup.find_all('div',attrs={"left":"54px"})

But all these print statements return empty lists.

Any Ideas?


Solution

  • You can pass in a regular expression instead of a string according to the documentation here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments

    So I would try this:

    import re
    
    soup = BeautifulSoup(open(extracted_html_file))
    soup.find_all('div', style = re.compile('left:54px'))