I have a html page where I am extracting all the headers (h1
to h7
) using beautiful soup and now I want a list where I want to append all the immediate higher level tags to the current tag.
For eg, I have this sample html page:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<h1>dummy h1</h1>
<h1>head 1</h1>
<p>para 1</p>
<h2>head 2</h2>
<p>para 2</p>
<h3>head 3</h3>
<p>p for head3</p>
<h2>head2(2)</h2>
<p>para3</p>
<h1>head1(2)</h1>
<h2>2nd h2</h2>
<h3>2nd h3</h3>
<p>2nd p for h3</p>
</body>
</html>
Here the list I want should look like
['head1','head1 head2','head1 head2 head3','head1 head2(2)','head1(2)','head1(2) 2nd h2','head1(2) 2nd h2 2nd h3']
The logic I am using is breaking the loop as soon as I encounter a smaller h tag while traversing backwards from the current h tag. This is creating a problem because the loop is breaking at head3
while traversing back from head2(2)
where it should ideally go upto head1
. Here is the code I tried:
file = open("sample.html","r")
page = file.read()
soup = BeautifulSoup(page, 'html.parser')
tags=['h1','h2','h3','h4','h5','h6','h7']
start=soup.find('h1') # the page I am working on starts with a dummy
head=[]
h=[]
h3=[]
for ele in start.next_siblings:
for i,tag in enumerate(tags):
if (ele.name==tag):
head.append('')
h.append(ele)
h3=deepcopy(h)
h3.reverse()
for j, q in enumerate(h3):
if q.name in tags[:i]:
head[len(head)-1]=(q.text.strip()) + ' ' + head[len(head)-1]
if j < len(h)-1 and (tags.index(q.name) == tags.index(h3[j+1].name)):
continue
if j < len(h)-1 and (tags.index(q.name) < tags.index(h3[j+1].name)):
break
head[len(head)-1]+=(ele.text.strip())+' '
break
print(head)
Please suggest what can I do to avoid this problem.
I found out what was wrong with your algorithm. You just need to do a test on the value of q.name
in your break
condition
if j < len(h)-1 and (tags.index(q.name) < tags.index(h3[j+1].name)) and q.name == 'h1':
break
So the full code will be:
file = open("sample.html","r")
page = file.read()
soup = BeautifulSoup(page, 'html.parser')
tags=['h1','h2','h3','h4','h5','h6','h7']
start=soup.find('h1') # the page I am working on starts with a dummy
head=[]
h=[]
h3=[]
for ele in start.next_siblings:
for i,tag in enumerate(tags):
if (ele.name==tag):
head.append('')
h.append(ele)
h3=deepcopy(h)
h3.reverse()
for j, q in enumerate(h3):
if q.name in tags[:i]:
head[len(head)-1]=(q.text.strip()) + ' ' + head[len(head)-1]
if j < len(h)-1 and (tags.index(q.name) == tags.index(h3[j+1].name)):
continue
if j < len(h)-1 and (tags.index(q.name) < tags.index(h3[j+1].name)) and q.name == 'h1':
break
head[len(head)-1]+=(ele.text.strip())+' '
break
print(head)
OUTPUT:
['head 1 ', 'head 1 head 2 ', 'head 1 head 2 head 3 ', 'head 1 head2(2) ', 'head1(2) ', 'head1(2) 2nd h2 ', 'head1(2) 2nd h2 2nd h3 ']
Let me know if it helps :-)