Search code examples
pythonweb-scrapingbeautifulsoupattrs.xml

Part of value inside of 'style' attrs to become variable in python bs4


Let's assume we have code:

 <tr class=" " somethingc1="" somethingc2="" somethingc3="" data-something="1" something="1something4" something_id="6something7">
 <td class="text-center td_something">
 <div>
 <span doo="true" class="foo" style="left:70%;z-index:99;">
 <span doo="true" class="foo" style="left:50%;z-index:90;">
 <span doo="true" class="Kung foo" style="left:90%;z-index:95;">
 </div>
 </td>
 </tr>
 <tr class=" " somethingc1="" somethingc2="" somethingc3="" data-something="1" something="1something4" something_id="6something7">
 <td class="text-center td_something">
 <div>
 <span doo="true" class="Kung foo" style="left:35%;z-index:95;">
 </div>
 </td>
 </tr>
 <tr class=" " somethingc1="" somethingc2="" somethingc3="" data-something="1" something="1something4" something_id="6something7">
 <td class="text-center td_something">
 <div>
 <span doo="true" class="foo" style="left:99%;z-index:100;">
 </div>
 </td>
 </tr>

How may I make a list using Bs4 in Python to find the highest value of 'left' in 'style' attrs keeping in mind that I do not want to take into consideration spans with class_ "Kung"

Desired result would be:

[70,False or NaN,99]

I've got it I should start with something like:

trs = soup.find_all('tr', attrs={"data-something": "1"})
List = list()
find_all('span',{'style': re.compile(r'^left:.')})

Solution

  • >>> import bs4
    >>> HTML = open('temp.htm').read()
    >>> soup = bs4.BeautifulSoup(HTML, 'lxml')
    

    First, select all of the element whose class contains foo (whether or not it contains something else as well).

    >>> elements = soup.select('.foo')
    

    In each case element['class'] will be a list of the items in class for the element, ie, either just foo or foo and Kung in the case of this HTML. Thus a test for the length of element['class'] is a test for the presence of foo alone.

    element['style'] gets the contents of style for the element. Use a regex for the part of it we want, and add it to the list called lefts.

    >>> lefts = [ ]
    >>> for element in elements:
    ...     if len(element['class'])==1:
    ...         lefts.append(int(bs4.re.search(r'left:([0-9]+)', element['style']).groups(0)[0]))
    ... 
    >>> 
    >>> lefts
    [70, 50, 99]
    

    Edit:

    Find the tr elements, then look for the elements with class foo. As before, include consideration of only those elements with just class foo not both foo and Kung. Gather left style elements for these elements and then find the maximum values of them.

    >>> HTML = open('temp.htm').read()
    >>> import bs4
    >>> soup = bs4.BeautifulSoup(HTML, 'lxml')
    >>> trs = soup.findAll('tr')
    >>> tr_max = []
    >>> for tr in trs:
    ...     elements = tr.select('.foo')
    ...     lefts = [ ]
    ...     for element in elements:
    ...         if len(element['class'])==1:
    ...             lefts.append(int(bs4.re.search(r'left:([0-9]+)', element['style']).groups(0)[0]))
    ...     if lefts:
    ...         tr_max.append(max(lefts))
    ...     else:
    ...         tr_max.append(None)
    ... 
    >>> tr_max
    [70, None, 99]