I am receiving a table in html and need to iterate through it to find a tag with rowspan
set. Once I find a cell with rowspan=<a number>
, I need to insert a block of code:
<tr>
<th rowspan="14" >Words</th>
<td style="height: 30px;"></td>
<td style="text-align: center; height: 30px;"></td>
<td style="height: 30px;"></td>
<td style="text-align: right; padding: 7px; min-width: 75px"></td>
<td style="height: 30px;"></td>
<td style="height: 30px;"></td>
<td style="height: 30px;"></td>
</tr>\n
as the row above the current row. Then, I need to remove this <th>
from the current row.
For example, this is the code I would be searching through:
<table border="1" class="dataframe" style="border: 1px solid grey">
<tbody>
<tr>
<th>Records</th>
<th>Worth</th>
<td>30</td>
<td>is</td>
<td>50</td>
<td>0</td>
<td>good</td>
<td></td>
</tr>
<tr>
<!-- this is the code im looking for -->
<th rowspan="13" valign="top">Reports</th>
<!-- -->
<th>Worth</th>
<td>30</td>
<td>=</td>
<td>40</td>
<td>0</td>
<td>bad</td>
<td></td>
</tr>
<tr>
<th>Worth</th>
<td>is</td>
<td>44</td>
<td>400.0</td>
<td></td>
<td>bad</td>
<td></td>
</tr>
</tbody>
</table>
So, once I find the <th>
with rowspan
, I need to insert the block as the row above it and then remove the <th>
from the current row. Here's how I'm doing it now:
for child in soup.tbody.descendants:
if child.name == 'th':
if 'rowspan' in child.attrs:
new_row = <<that block from above>>
crazy_tag = bs4.BeautifulSoup(new_row, 'html.parser')
x = child.find_previous('tr')
x.insert_before(crazy_tag)
child.extract()
The output I am looking for is this:
<table border="1" class="dataframe" style="border: 1px solid grey">
<tbody>
<tr>
<th>Records</th>
<th>Worth</th>
<td>30</td>
<td>is</td>
<td>50</td>
<td>0</td>
<td>good</td>
<td></td>
</tr>
<tr>
<th rowspan="14" >Words</th>
<td style="height: 30px;"></td>
<td style="text-align: center; height: 30px;"></td>
<td style="height: 30px;"></td>
<td style="text-align: right; padding: 7px; min-width: 75px"></td>
<td style="height: 30px;"></td>
<td style="height: 30px;"></td>
<td style="height: 30px;"></td>
</tr>
<tr>
<th>Worth</th>
<td>30</td>
<td>=</td>
<td>40</td>
<td>0</td>
<td>bad</td>
<td></td>
</tr>
<tr>
<th>Worth</th>
<td>is</td>
<td>44</td>
<td>400.0</td>
<td></td>
<td>bad</td>
<td></td>
</tr>
</tbody>
</table>
The good news is, my code does what I want, and I get the desired output. The bad news is, there are other things I have to do to this html before I'm done. After I do this operation and it continues looping through the descendants, the next iteration gives me None. I thought extract() kept the structure of the tree intact but it seems like either the block I am inserting or the line I am deleting is not preserving the tree structure. Any ideas?
My question basically boils down to: how do I insert some html into a beautiful soup object and extract a line without ruining the sibling relationships in the document?
Instead .insert_before()
/.extract()
you can use simple .replace_with()
:
from bs4 import BeautifulSoup
html_text = """\
<table border="1" class="dataframe" style="border: 1px solid grey">
<tbody>
<tr>
<th>Records</th>
<th>Worth</th>
<td>30</td>
<td>is</td>
<td>50</td>
<td>0</td>
<td>good</td>
<td></td>
</tr>
<tr>
<!-- this is the code im looking for -->
<th rowspan="13" valign="top">Reports</th>
<!-- -->
<th>Worth</th>
<td>30</td>
<td>=</td>
<td>40</td>
<td>0</td>
<td>bad</td>
<td></td>
</tr>
<tr>
<th>Worth</th>
<td>is</td>
<td>44</td>
<td>400.0</td>
<td></td>
<td>bad</td>
<td></td>
</tr>
</tbody>
</table>"""
snippet = """\
<tr>
<th rowspan="14" >Words</th>
<td style="height: 30px;"></td>
<td style="text-align: center; height: 30px;"></td>
<td style="height: 30px;"></td>
<td style="text-align: right; padding: 7px; min-width: 75px"></td>
<td style="height: 30px;"></td>
<td style="height: 30px;"></td>
<td style="height: 30px;"></td>
</tr>"""
soup = BeautifulSoup(html_text, "html.parser")
for th in soup.select("th[rowspan]"):
th.replace_with(BeautifulSoup(snippet, "html.parser"))
print(soup)
Prints:
<table border="1" class="dataframe" style="border: 1px solid grey">
<tbody>
<tr>
<th>Records</th>
<th>Worth</th>
<td>30</td>
<td>is</td>
<td>50</td>
<td>0</td>
<td>good</td>
<td></td>
</tr>
<tr>
<!-- this is the code im looking for -->
<tr>
<th rowspan="14">Words</th>
<td style="height: 30px;"></td>
<td style="text-align: center; height: 30px;"></td>
<td style="height: 30px;"></td>
<td style="text-align: right; padding: 7px; min-width: 75px"></td>
<td style="height: 30px;"></td>
<td style="height: 30px;"></td>
<td style="height: 30px;"></td>
</tr>
<!-- -->
<th>Worth</th>
<td>30</td>
<td>=</td>
<td>40</td>
<td>0</td>
<td>bad</td>
<td></td>
</tr>
<tr>
<th>Worth</th>
<td>is</td>
<td>44</td>
<td>400.0</td>
<td></td>
<td>bad</td>
<td></td>
</tr>
</tbody>
</table>