Search code examples
pythonhtmlhtml-parsingbeautifulsoup

Using beautifulsoup to extract text between line breaks (e.g. <br /> tags)


I have the following HTML that is within a larger document

<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />

I'm currently using BeautifulSoup to obtain other elements within the HTML, but I have not been able to find a way to get the important lines of text between <br /> tags. I can isolate and navigate to each of the <br /> elements, but can't find a way to get the text in between. Any help would be greatly appreciated. Thanks.


Solution

  • If you just want any text which is between two <br /> tags, you could do something like the following:

    from BeautifulSoup import BeautifulSoup, NavigableString, Tag
    
    input = '''<br />
    Important Text 1
    <br />
    <br />
    Not Important Text
    <br />
    Important Text 2
    <br />
    Important Text 3
    <br />
    <br />
    Non Important Text
    <br />
    Important Text 4
    <br />'''
    
    soup = BeautifulSoup(input)
    
    for br in soup.findAll('br'):
        next_s = br.nextSibling
        if not (next_s and isinstance(next_s,NavigableString)):
            continue
        next2_s = next_s.nextSibling
        if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br':
            text = str(next_s).strip()
            if text:
                print "Found:", next_s
    

    But perhaps I misunderstand your question? Your description of the problem doesn't seem to match up with the "important" / "non important" in your example data, so I've gone with the description ;)