Search code examples

Extracting table contents from html with python and BeautifulSoup

I want to extract certain information out of an html document. E.g. it contains a table (among other tables with other contents) like this:

    <table class="details">
                    <td>Bug Fix Advisory</td>
                    <th>Issued on:</th>
                    <th>Last updated on:</th>

                    <th valign="top">Affected Products:</th>
                    <td><a href="#Red Hat Enterprise Linux ELS (v. 4)">Red Hat Enterprise Linux ELS (v. 4)</a></td>


I want to extract Information like the date of "Issued on:". It looks like BeautifulSoup4 could do this easyly, but somehow I don't manage to get it right. My code so far:

    from bs4 import BeautifulSoup
    if table_tag['class'] == ['details']:
            print + " " +
            print  unicode(a)
            print table_tag.contents

This gets me the contents of the first table row, and also a listing of the contents. But the next sibling thing is not working right, I guess I am just using it wrong. Of course I could just parse the contents thingy, but it seems to me that beautiful soup was designed to prevent us from doing exactly this (if I start parsing myself, I might as well parse the whole doc ...). If someone could enlighten me on how to acomplish this, I would be gratefull. If there is a better way then BeautifulSoup, I would be interested to hear about it.


  • >>> from bs4 import BeautifulSoup
    >>> soup = BeautifulSoup(unicodestring_containing_the_entire_htlm_doc)
    >>> table = soup.find('table', {'class': 'details'})
    >>> th = table.find('th', text='Issued on:')
    >>> th
    <th>Issued on:</th>
    >>> td = th.findNext('td')
    >>> td
    >>> td.text