Search code examples
pythonhtmlbeautifulsoup

Get an <a> tag content using BeautifulSoup


I'd like to get the content of an <a> tag using BeautifulSoup (version 4.12.3) in Python. I have this code and HTML exemple:

h = """
<a id="0">
    <table> 
  <thead>
    <tr>
      <th scope="col">Person</th>
      <th scope="col">Most interest in</th>
      <th scope="col">Age</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th scope="row">Chris</th>
      <td>HTML tables</td>
      <td>22</td>
    </tr>
    </table>
</a>
"""

test = bs4.BeautifulSoup(h)
test.find('a')  # find_all, select => same results

But it only returns :

<a id="0">
</a>

I'd would expect that the content inside <table> would appear between <a> tags. (I don't know if it is common to wrap a table inside an <a> tag but the HTML code I try to read is like so)

I need to parse the table content from the <a> tag since I need to link the id="0" to the content of the table.

How can I achieve that ? How can I get the <a> tag content with the <table> tag ?


Solution

  • Specify explicitly the parser you want to use (use html.parser). By default it will use the "best" parser available - I pressume lxml which doesn't parse this document well:

    import bs4
    
    h = """
    <a id="0">
        <table> 
      <thead>
        <tr>
          <th scope="col">Person</th>
          <th scope="col">Most interest in</th>
          <th scope="col">Age</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th scope="row">Chris</th>
          <td>HTML tables</td>
          <td>22</td>
        </tr>
        </table>
    </a>
    """
    
    test = bs4.BeautifulSoup(h, "html.parser")  # <-- define parser here
    out = test.find("a")
    
    print(out)
    

    Prints:

    <a id="0">
    <table>
    <thead>
    <tr>
    <th scope="col">Person</th>
    <th scope="col">Most interest in</th>
    <th scope="col">Age</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <th scope="row">Chris</th>
    <td>HTML tables</td>
    <td>22</td>
    </tr>
    </tbody></table>
    </a>