Search code examples
pythonpython-3.xpython-2.7beautifulsouppython-beautifultable

Beautifulsoup - Extract text from next div sub tag based on previous div sub tag


I'm trying to extract the data which is in next span of div based on previous div-span text.below is the html content,

<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:37px; top:161px; width:38px; height:13px;"><span style="font-family: b'Times-Bold'; font-size:13px">Name
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:85px; top:161px; width:58px; height:13px;"><span style="font-family: b'Helvetica'; font-size:13px">Ven
    <br></span></div>

I trying to find the text using,

n_field = soup.find('span', text="Name\")

And then trying to get the text from next sibling using,

n_field.next_sibling()

However, due to "\n" in the field, I'm unable to find the span and the extract the next_sibling text.

In short, I'm trying to form a dict in the below format,

{"Name": "Ven"}

Any help or idea on this is appreciated.


Solution

  • You could use re instead of bs4.

    import re
    
    html = """
        <div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:37px; top:161px; width:38px; height:13px;">
            <span style="font-family: b'Times-Bold'; font-size:13px">Name
                <br>
            </span>
        </div>
        <div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:85px; top:161px; width:58px; height:13px;">
            <span style="font-family: b'Helvetica'; font-size:13px">Ven
                <br>
            </span>
        """
    
    mo = re.search(r'(Name).*?<span.*?13px">(.*?)\n', html, re.DOTALL)
    print(mo.groups())
    
    # for consecutive cases use re.finditer or re.findall
    html *= 5
    mo = re.finditer(r'(Name).*?<span.*?13px">(.*?)\n', html, re.DOTALL)
    
    for match in mo:
        print(match.groups())
    
    for (key, value) in re.findall(r'(Name).*?<span.*?13px">(.*?)\n', html, re.DOTALL):
        print(key, value)