I'm trying to create a nested table of content based on heading tags of HTML.
My HTML file:
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<h1>
My report Name
</h1>
<h1 id="2">First Chapter </h1>
<h2 id="3"> First Sub-chapter of the first chapter</h2>
<ul>
<h1 id="text1">Useless h1</h1>
<p>
some text
</p>
</ul>
<h2 id="4">Second Sub-chapter of the first chapter </h2>
<ul>
<h1 id="text2">Useless h1</h1>
<p>
some text
</p>
</ul>
<h1 id="5">Second Chapter </h1>
<h2 id="6">First Sub-chapter of the Second chapter </h2>
<ul>
<h1 id="text6">Useless h1</h1>
<p>
some text
</p>
</ul>
<h2 id="7">Second Sub-chapter of the Second chapter </h2>
<ul>
<h1 id="text6">Useless h1</h1>
<p>
some text
</p>
</ul>
</body>
</html>
My python code:
import from lxml import html
from bs4 import BeautifulSoup as soup
import re
import codecs
#Access to the local URL(Html file)
f = codecs.open("C:\\x\\test.html", 'r')
page = f.read()
f.close()
#html parsing
page_soup = soup(page,"html.parser")
tree = html.fromstring(page)#extract report name
ref = page_soup.find("h1",{"id": False}).text.strip()
print("the name of the report is : " + ref + " \n")
chapters = page_soup.findAll('h1', attrs={'id': re.compile("^[0-9]*$")})
print("We have " + str(len(chapters)) + " chapter(s)")
for index, chapter in enumerate(chapters):
print(str(index+1) +"-" + str(chapter.text.strip()) + "\n")
sub_chapters = page_soup.findAll('h2', attrs={'id': re.compile("^[0-9]*$")})
print("We have " + str(len(sub_chapters)) + " sub_chapter(s)")
for index, sub_chapter in enumerate(sub_chapters):
print(str(index+1) +"-" +str(sub_chapter.text.strip()) + "\n")
With this code, I am able to get all the chapters and all the sub-chapters but it is not my goal.
My goal is to get the below as my table of contents:
1-First Chapter
1-First sub-chapter of the first chapter
2-Second sub-chapter of the first chapter
2-Second Chapter
1-First sub-chapter of the Second chapter
2-Second sub-chapter of the Second chapter
Any recommendation or ideas on how to achieve my desired table of contents format?
You can use itertools.groupby
after finding all the data associated with each chapter:
from itertools import groupby, count
import re
from bs4 import BeautifulSoup as soup
data = [[i.name, re.sub('\s+$', '', i.text)] for i in soup(content, 'html.parser').find_all(re.compile('h1|h2'), {'id':re.compile('^\d+$')})]
grouped, _count = [[a, list(b)] for a, b in groupby(data, key=lambda x:x[0] == 'h1')], count(1)
new_grouped = [[grouped[i][-1][0][-1], [c for _, c in grouped[i+1][-1]]] for i in range(0, len(grouped), 2)]
final_string = '\n'.join(f'{next(_count)}-{a}\n'+'\n'.join(f'\t{i}-{c}' for i, c in enumerate(b, 1)) for a, b in new_grouped)
print(final_string)
Output:
1-First Chapter
1- First Sub-chapter of the first chapter
2-Second Sub-chapter of the first chapter
2-Second Chapter
1-First Sub-chapter of the Second chapter
2-Second Sub-chapter of the Second chapter