How do I extract all the text below a specific header? In this case, I need to extract the text under Topic 2
. EDIT: On other webpages, "Topic 2" sometimes appears as the third heading, or the first. "Topic 2" isn't always in the same place, and it doesn't always have the same id number.
# import library
from bs4 import BeautifulSoup
# dummy webpage text
body = '''
<h2 id="1">Topic 1</h2>
<p> This is the first sentence.</p>
<p> This is the second sentence.</p>
<p> This is the third sentence.</p>
<h2 id="2">Topic 2</h2>
<p> This is the fourth sentence.</p>
<p> This is the fifth sentence.</p>
<h2 id="3">Topic 3</h2>
<p> This is the sixth sentence.</p>
<p> This is the seventh sentence.</p>
<p> This is the eighth sentence.</p>
'''
# convert text to soup
soup = BeautifulSoup(body, 'lxml')
If I extract text only under '''Topic 2''', this is what my output would be.
This is the fourth sentence. This is the fifth sentence.
My attempts to solve this problem:
I tried soup.select('h2 + p')
, but this only got me the first sentences under each header.
[<p> This is the first sentence.</p>,
<p> This is the fourth sentence.</p>,
<p> This is the sixth sentence.</p>]
I also tried this, but it gave me all the text, when I only need text under Topic 2
:
import pandas as pd
lst = []
for row in soup.find_all('p'):
text_dict = {}
text_dict['text'] = row.text
lst.append(text_dict)
df = pd.DataFrame(lst)
df
| | text |
|---|-------------------------------|
| 0 | This is the first sentence. |
| 1 | This is the second sentence. |
| 2 | This is the third sentence. |
| 3 | This is the fourth sentence. |
| 4 | This is the fifth sentence. |
| 5 | This is the sixth sentence. |
| 6 | This is the seventh sentence. |
| 7 | This is the eighth sentence. |
Try:
target = soup.find('h2',string='Topic 2')
for sib in target.find_next_siblings():
if sib.name=="h2":
break
else:
print(sib.text)
Output (from you html above):
This is the fourth sentence.
This is the fifth sentence.