Search code examples
pythonhtmlparsingbeautifulsouphtml-parsing

Replacing table content using beautifulsoup


I want to parse a HTML document which has tabular data also in it using beautiful soup. I am doing some NLP over it.

The table cells might have just numbers or might be text heavy. So before doing soup.get_text(), I wish to change the content of the tabular data as per the following condition.

Condition: If the cell has more than two words( we can consider a number to be one word), then only keep it, else change the cell contents to an empty string.

<code to change table data based on condition>

soup = BeautifulSoup(html)
text = soup.get_text()

Here is something that I have tried.

    tables = soup.find_all('table')
    for table in tables:
        table_body = table.find('tbody')
        rows = table_body.find_all('tr')
        for row in rows:
            cols = row.find_all('td')
            for ele in cols:
                if len(ele.text.split(' ')<3):
                    ele.text = ''

However, we can't set ele.text so it throws an error.

Here's a simple HTML Structure with Table

<!DOCTYPE html>
<html>

   <head>
      <title>HTML Tables</title>
   </head>

   <body>
      <table border = "1">
         <tr>
            <td><p><span>Row 1, Column 1, This should be kept because it has more than two tokens</span></p></td>
            <td><p><span>not kept</span></p></td>
         </tr>

         <tr>
            <td><p><span>Row 2, Column 1, should be kept</span></p></td>
            <td><p><span>Row 2, Column 2, should be kept</span></p></td>
         </tr>
      </table>

   </body>
</html>

Solution

  • Once you found the element then use ele.string.replace_with("")

    Based on your sample html

    html='''<html>
    
       <head>
          <title>HTML Tables</title>
       </head>
    
       <body>
          <table border = "1">
             <tr>
                <td><p><span>Row 1, Column 1, This should be kept because it has more than two tokens</span></p></td>
                <td><p><span>not kept</span></p></td>
             </tr>
    
             <tr>
                <td><p><span>Row 2, Column 1, should be kept</span></p></td>
                <td><p><span>Row 2, Column 2, should be kept</span></p></td>
             </tr>
          </table>
    
       </body>
    </html>'''
    
    soup=BeautifulSoup(html,'html.parser')
    tables = soup.find_all('table')
    for table in tables:
        rows = table.find_all('tr')
        for row in rows:
            cols = row.find_all('td')
            for ele in cols:
                if len(ele.text.split(' '))<3:
                   ele.string.replace_with("")
    
    print(soup)
    

    Output:

    <html>
    <head>
    <title>HTML Tables</title>
    </head>
    <body>
    <table border="1">
    <tr>
    <td><p><span>Row 1, Column 1, This should be kept because it has more than two tokens</span></p></td>
    <td><p><span></span></p></td>
    </tr>
    <tr>
    <td><p><span>Row 2, Column 1, should be kept</span></p></td>
    <td><p><span>Row 2, Column 2, should be kept</span></p></td>
    </tr>
    </table>
    </body>
    </html>