Search code examples
pythonweb-scrapingbeautifulsoup

Use BeautifulSoup to extract text under specific header


How do I extract all the text below a specific header? In this case, I need to extract the text under Topic 2. EDIT: On other webpages, "Topic 2" sometimes appears as the third heading, or the first. "Topic 2" isn't always in the same place, and it doesn't always have the same id number.

# import library
from bs4 import BeautifulSoup

# dummy webpage text
body = '''
<h2 id="1">Topic 1</h2>
<p> This is the first sentence.</p>
<p> This is the second sentence.</p>
<p> This is the third sentence.</p>

<h2 id="2">Topic 2</h2>
<p> This is the fourth sentence.</p>
<p> This is the fifth sentence.</p>

<h2 id="3">Topic 3</h2>
<p> This is the sixth sentence.</p>
<p> This is the seventh sentence.</p>
<p> This is the eighth sentence.</p>
'''

# convert text to soup 
soup = BeautifulSoup(body, 'lxml')

If I extract text only under '''Topic 2''', this is what my output would be.

This is the fourth sentence. This is the fifth sentence.

My attempts to solve this problem:

I tried soup.select('h2 + p'), but this only got me the first sentences under each header.

[<p> This is the first sentence.</p>,
 <p> This is the fourth sentence.</p>,
 <p> This is the sixth sentence.</p>]

I also tried this, but it gave me all the text, when I only need text under Topic 2:

import pandas as pd 

lst = []
for row in soup.find_all('p'):
    text_dict = {}
    text_dict['text'] = row.text
    lst.append(text_dict)

df = pd.DataFrame(lst) 

df

|   | text                          |
|---|-------------------------------|
| 0 | This is the first sentence.   |
| 1 | This is the second sentence.  |
| 2 | This is the third sentence.   |
| 3 | This is the fourth sentence.  |
| 4 | This is the fifth sentence.   |
| 5 | This is the sixth sentence.   |
| 6 | This is the seventh sentence. |
| 7 | This is the eighth sentence.  |

Solution

  • Try:

    target = soup.find('h2',string='Topic 2')
    for sib in target.find_next_siblings():
        if sib.name=="h2":
            break
        else:
            print(sib.text)
    

    Output (from you html above):

     This is the fourth sentence.
     This is the fifth sentence.