I'd like to get the content of an <a>
tag using BeautifulSoup (version 4.12.3) in Python.
I have this code and HTML exemple:
h = """
<a id="0">
<table>
<thead>
<tr>
<th scope="col">Person</th>
<th scope="col">Most interest in</th>
<th scope="col">Age</th>
</tr>
</thead>
<tbody>
<tr>
<th scope="row">Chris</th>
<td>HTML tables</td>
<td>22</td>
</tr>
</table>
</a>
"""
test = bs4.BeautifulSoup(h)
test.find('a') # find_all, select => same results
But it only returns :
<a id="0">
</a>
I'd would expect that the content inside <table>
would appear between <a>
tags.
(I don't know if it is common to wrap a table inside an <a>
tag but the HTML code I try to read is like so)
I need to parse the table content from the <a>
tag since I need to link the id="0"
to the content of the table.
How can I achieve that ?
How can I get the <a>
tag content with the <table>
tag ?
Specify explicitly the parser you want to use (use html.parser
). By default it will use the "best" parser available - I pressume lxml
which doesn't parse this document well:
import bs4
h = """
<a id="0">
<table>
<thead>
<tr>
<th scope="col">Person</th>
<th scope="col">Most interest in</th>
<th scope="col">Age</th>
</tr>
</thead>
<tbody>
<tr>
<th scope="row">Chris</th>
<td>HTML tables</td>
<td>22</td>
</tr>
</table>
</a>
"""
test = bs4.BeautifulSoup(h, "html.parser") # <-- define parser here
out = test.find("a")
print(out)
Prints:
<a id="0">
<table>
<thead>
<tr>
<th scope="col">Person</th>
<th scope="col">Most interest in</th>
<th scope="col">Age</th>
</tr>
</thead>
<tbody>
<tr>
<th scope="row">Chris</th>
<td>HTML tables</td>
<td>22</td>
</tr>
</tbody></table>
</a>