I am trying to fix an HTML file. It has multiple table entries and I'd like to convert it to "ul li" of the table contents.
I have tried finding all "table" tags and replaced them with "li" (see code below) but cannot "wrap" a "ul" between the list
<p> Hello world!</p>
<table><tr><td> </td><td>•</td><td><p>First bullet point text</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>Second</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>Third</p></td></tr></table>
<table><tr><td> </td><td">•</td><td><p>Last</p></td></tr></table>
<p>Some paragraph</p>
<table> </td><td>•</td><td><p>1st item of 2nd list</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>2nd item of 2nd list</p></td></tr></table>
<p>Another paragraph</p>
I have done the following:
def replaceBullets(soup):
if soup.find('table'):
for table in soup.findAll('table'):
if isUnordered(table.text):
replacement = soup.new_tag("li")
replacement.string = table.p.text
table.replace_with(replacement)
def isUnordered(line):
if u'\u2022' in line and u'\xa0' in line:
return True
return False
I would like to get:
<p>Hello world!</p>
<ul><li>First bullet point text</li>
<li>Second</li>
<li>Third</li>
<li>Last</li></ul>
<p>Some paragraph</p>
<ul><li>1st item of 2nd list</li>
<li>2nd item of 2nd list</li></ul>
<p>Another paragraph</p>
but I cannot find a way to insert the "ul" tag
Wow, it's been a cumbersome task, but I've finally managed to do it. I've used find
function with a filter function to find <p>
elements inside the table.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function
Please note that I've fixed the malformed parts of HTML you've posted.
from bs4 import BeautifulSoup, Tag
if __name__ == "__main__":
html = '''
<p>Hello world!</p>
<table><tr><td> </td><td>•</td><td><p>First bullet point text</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>Second</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>Third</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>Last</p></td></tr></table>
<p>Some paragraph</p>
<table><tr><td> </td><td>•</td><td><p>1st item of 2nd list</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>2nd item of 2nd list</p></td></tr></table>
<p>Another paragraph</p>
'''
soup = BeautifulSoup(html, 'html.parser')
# find all <p>s under a table and replace table with the <p> element
def p_under_table_extractor(el: Tag):
table_parent = el.find_parent('table')
return el.name == 'p' and table_parent
for p in soup.find_all(p_under_table_extractor):
table_parent = p.find_parent('table')
p.name = 'li'
table_parent.replace_with(p)
# the only <p>s are the root <p>s
for p in soup.find_all('p'):
# find all succeeding <li>s
li_els = []
for el in p.find_all_next():
if el.name != 'li':
break
else:
li_els.append(el)
# put those <li>s inside a <ul>
if li_els:
ul = soup.new_tag('ul')
for li in li_els:
ul.append(li)
# and put <ul> after the <p>
p.insert_after(ul)
print(soup.prettify())
which prints:
<p>Hello world!</p>
<ul>
<li>First bullet point text</li>
<li>Second</li>
<li>Third</li>
<li>Last</li>
</ul>
<p>Some paragraph</p>
<ul>
<li>1st item of 2nd list</li>
<li>2nd item of 2nd list</li>
</ul>
<p>Another paragraph</p>