Search code examples
pythonhtmlbeautifulsoupcss-tables

How can I transform multiple tables to an unordered list of items, where each table is a <li>?


I am trying to fix an HTML file. It has multiple table entries and I'd like to convert it to "ul li" of the table contents.

I have tried finding all "table" tags and replaced them with "li" (see code below) but cannot "wrap" a "ul" between the list

<p> Hello world!</p>
<table><tr><td>&nbsp;</td><td>&bull;</td><td><p>First bullet point text</p></td></tr></table>
<table><tr><td>&nbsp;</td><td>&bull;</td><td><p>Second</p></td></tr></table>
<table><tr><td>&nbsp;</td><td>&bull;</td><td><p>Third</p></td></tr></table>
<table><tr><td>&nbsp;</td><td">&bull;</td><td><p>Last</p></td></tr></table>
<p>Some paragraph</p>
<table>&nbsp;</td><td>&bull;</td><td><p>1st item of 2nd list</p></td></tr></table>
<table><tr><td>&nbsp;</td><td>&bull;</td><td><p>2nd item of 2nd list</p></td></tr></table>
<p>Another paragraph</p>

I have done the following:

def replaceBullets(soup):
    if soup.find('table'):
        for table in soup.findAll('table'):
            if isUnordered(table.text):
                replacement = soup.new_tag("li")
                replacement.string = table.p.text
                table.replace_with(replacement)

def isUnordered(line):
    if u'\u2022' in line and u'\xa0' in line:
        return True
    return False

I would like to get:

<p>Hello world!</p>
<ul><li>First bullet point text</li>
<li>Second</li>
<li>Third</li>
<li>Last</li></ul>
<p>Some paragraph</p>
<ul><li>1st item of 2nd list</li>
<li>2nd item of 2nd list</li></ul>
<p>Another paragraph</p>

but I cannot find a way to insert the "ul" tag


Solution

  • Wow, it's been a cumbersome task, but I've finally managed to do it. I've used find function with a filter function to find <p> elements inside the table.

    https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function

    Please note that I've fixed the malformed parts of HTML you've posted.

    from bs4 import BeautifulSoup, Tag
    
    if __name__ == "__main__":
    
        html = '''
        <p>Hello world!</p>
    <table><tr><td>&nbsp;</td><td>&bull;</td><td><p>First bullet point text</p></td></tr></table>
    <table><tr><td>&nbsp;</td><td>&bull;</td><td><p>Second</p></td></tr></table>
    <table><tr><td>&nbsp;</td><td>&bull;</td><td><p>Third</p></td></tr></table>
    <table><tr><td>&nbsp;</td><td>&bull;</td><td><p>Last</p></td></tr></table>
    <p>Some paragraph</p>
    <table><tr><td>&nbsp;</td><td>&bull;</td><td><p>1st item of 2nd list</p></td></tr></table>
    <table><tr><td>&nbsp;</td><td>&bull;</td><td><p>2nd item of 2nd list</p></td></tr></table>
    <p>Another paragraph</p>
        '''
    
        soup = BeautifulSoup(html, 'html.parser')
    
        # find all <p>s under a table and replace table with the <p> element
        def p_under_table_extractor(el: Tag):
            table_parent = el.find_parent('table')
            return el.name == 'p' and table_parent
    
        for p in soup.find_all(p_under_table_extractor):
            table_parent = p.find_parent('table')
            p.name = 'li'
            table_parent.replace_with(p)
    
        # the only <p>s are the root <p>s
        for p in soup.find_all('p'):
            # find all succeeding <li>s
            li_els = []
            for el in p.find_all_next():
                if el.name != 'li':
                    break
                else:
                    li_els.append(el)
            # put those <li>s inside a <ul>
            if li_els:
                ul = soup.new_tag('ul')
                for li in li_els:
                    ul.append(li)
                # and put <ul> after the <p>
                p.insert_after(ul)
    
        print(soup.prettify())
    
    

    which prints:

    <p>Hello world!</p>
    <ul>
        <li>First bullet point text</li>
        <li>Second</li>
        <li>Third</li>
        <li>Last</li>
    </ul>
    <p>Some paragraph</p>
    <ul>
        <li>1st item of 2nd list</li>
        <li>2nd item of 2nd list</li>
    </ul>
    <p>Another paragraph</p>