Search code examples
pythonhtmlbeautifulsouptags

How to insert and remove tags while maintaining siblings beautifulsoup?


I am receiving a table in html and need to iterate through it to find a tag with rowspan set. Once I find a cell with rowspan=<a number>, I need to insert a block of code:

<tr>
<th rowspan="14" >Words</th>
<td style="height: 30px;"></td>
<td style="text-align: center; height: 30px;"></td>
<td style="height: 30px;"></td>
<td style="text-align: right; padding: 7px; min-width: 75px"></td>
<td style="height: 30px;"></td>
<td style="height: 30px;"></td>
<td style="height: 30px;"></td>
</tr>\n

as the row above the current row. Then, I need to remove this <th> from the current row.

For example, this is the code I would be searching through:

<table border="1" class="dataframe" style="border: 1px solid grey">
<tbody>
    <tr>
      <th>Records</th>
      <th>Worth</th>
      <td>30</td>
      <td>is</td>
      <td>50</td>
      <td>0</td>
      <td>good</td>
      <td></td>
    </tr>
    <tr>
      <!-- this is the code im looking for -->
      <th rowspan="13" valign="top">Reports</th>
      <!--  -->
      <th>Worth</th>
      <td>30</td>
      <td>=</td>
      <td>40</td>
      <td>0</td>
      <td>bad</td>
      <td></td>
    </tr>
    <tr>
      <th>Worth</th>
      <td>is</td>
      <td>44</td>
      <td>400.0</td>
      <td></td>
      <td>bad</td>
      <td></td>
    </tr>
</tbody>
</table>

So, once I find the <th> with rowspan, I need to insert the block as the row above it and then remove the <th> from the current row. Here's how I'm doing it now:

for child in soup.tbody.descendants:
        if child.name == 'th':
            if 'rowspan' in child.attrs:
                new_row = <<that block from above>>
                crazy_tag = bs4.BeautifulSoup(new_row, 'html.parser')
                x = child.find_previous('tr')
                x.insert_before(crazy_tag)
                child.extract()

The output I am looking for is this:

<table border="1" class="dataframe" style="border: 1px solid grey">
<tbody>
    <tr>
      <th>Records</th>
      <th>Worth</th>
      <td>30</td>
      <td>is</td>
      <td>50</td>
      <td>0</td>
      <td>good</td>
      <td></td>
    </tr>
    <tr>
      <th rowspan="14" >Words</th>
      <td style="height: 30px;"></td>
      <td style="text-align: center; height: 30px;"></td>
      <td style="height: 30px;"></td>
      <td style="text-align: right; padding: 7px; min-width: 75px"></td>
      <td style="height: 30px;"></td>
      <td style="height: 30px;"></td>
      <td style="height: 30px;"></td>
    </tr>
    <tr>
      <th>Worth</th>
      <td>30</td>
      <td>=</td>
      <td>40</td>
      <td>0</td>
      <td>bad</td>
      <td></td>
    </tr>
    <tr>
      <th>Worth</th>
      <td>is</td>
      <td>44</td>
      <td>400.0</td>
      <td></td>
      <td>bad</td>
      <td></td>
    </tr>
</tbody>
</table>

The good news is, my code does what I want, and I get the desired output. The bad news is, there are other things I have to do to this html before I'm done. After I do this operation and it continues looping through the descendants, the next iteration gives me None. I thought extract() kept the structure of the tree intact but it seems like either the block I am inserting or the line I am deleting is not preserving the tree structure. Any ideas?

My question basically boils down to: how do I insert some html into a beautiful soup object and extract a line without ruining the sibling relationships in the document?


Solution

  • Instead .insert_before()/.extract() you can use simple .replace_with():

    from bs4 import BeautifulSoup
    
    html_text = """\
    <table border="1" class="dataframe" style="border: 1px solid grey">
    <tbody>
        <tr>
          <th>Records</th>
          <th>Worth</th>
          <td>30</td>
          <td>is</td>
          <td>50</td>
          <td>0</td>
          <td>good</td>
          <td></td>
        </tr>
        <tr>
          <!-- this is the code im looking for -->
          <th rowspan="13" valign="top">Reports</th>
          <!--  -->
          <th>Worth</th>
          <td>30</td>
          <td>=</td>
          <td>40</td>
          <td>0</td>
          <td>bad</td>
          <td></td>
        </tr>
        <tr>
          <th>Worth</th>
          <td>is</td>
          <td>44</td>
          <td>400.0</td>
          <td></td>
          <td>bad</td>
          <td></td>
        </tr>
    </tbody>
    </table>"""
    
    snippet = """\
    <tr>
    <th rowspan="14" >Words</th>
    <td style="height: 30px;"></td>
    <td style="text-align: center; height: 30px;"></td>
    <td style="height: 30px;"></td>
    <td style="text-align: right; padding: 7px; min-width: 75px"></td>
    <td style="height: 30px;"></td>
    <td style="height: 30px;"></td>
    <td style="height: 30px;"></td>
    </tr>"""
    
    soup = BeautifulSoup(html_text, "html.parser")
    
    for th in soup.select("th[rowspan]"):
        th.replace_with(BeautifulSoup(snippet, "html.parser"))
    
    print(soup)
    

    Prints:

    <table border="1" class="dataframe" style="border: 1px solid grey">
    <tbody>
    <tr>
    <th>Records</th>
    <th>Worth</th>
    <td>30</td>
    <td>is</td>
    <td>50</td>
    <td>0</td>
    <td>good</td>
    <td></td>
    </tr>
    <tr>
    <!-- this is the code im looking for -->
    <tr>
    <th rowspan="14">Words</th>
    <td style="height: 30px;"></td>
    <td style="text-align: center; height: 30px;"></td>
    <td style="height: 30px;"></td>
    <td style="text-align: right; padding: 7px; min-width: 75px"></td>
    <td style="height: 30px;"></td>
    <td style="height: 30px;"></td>
    <td style="height: 30px;"></td>
    </tr>
    <!-- -->
    <th>Worth</th>
    <td>30</td>
    <td>=</td>
    <td>40</td>
    <td>0</td>
    <td>bad</td>
    <td></td>
    </tr>
    <tr>
    <th>Worth</th>
    <td>is</td>
    <td>44</td>
    <td>400.0</td>
    <td></td>
    <td>bad</td>
    <td></td>
    </tr>
    </tbody>
    </table>