Search code examples
pythonweb-scrapingbeautifulsoup

Using a nested element's text as a selector in BeautifulSoup


I'm looking to scrape the following HTML structure:

<p><strong>ID:</strong>547</p>
<p><strong>Class:</strong>foobar</p>
<p><strong>Procedures:</strong>lorem ipsum.</p>
<p>dolor sit amet.</p>
...
<p><strong>Description:</strong>curabitur at orci posuere.</p>
<p>massa nec fringilla.</p>
...

I'm not too confident in working with BeautifulSoup and am not too sure how to handle the fact that the identifier for a given section (id, class, procedures and description) is nested inside the first paragraph containing the content for that section.

I'm trying to get somewhere along the lines of the following:

{
    'id': 547,
    'class': 'foobar',
    'procedures': 'lorem ipsum. dolor sit amet.'
    'description': 'curabitur at orci posuere. massa nec fringilla.'
}

Solution

  • You can use the element.next_sibling reference to get the text following the <strong> tags. For p tags without strong tags you'd have to append to the last processed key.

    Using the Element.find_all() method to select all <p> tags, loop and update a dictionary:

    mapping = {}
    key = None
    for item in soup.find_all('p'):
        if item.strong:
            key = item.strong.get_text(strip=True).rstrip(':')
            value = item.strong.next_sibling.strip()
        else:
            value = mapping[key] + ' ' + item.get_text(strip=True)
        mapping[key] = value
    

    Demo:

    >>> from bs4 import BeautifulSoup
    >>> soup = BeautifulSoup('''\
    ... <p><strong>ID:</strong>547</p>
    ... <p><strong>Class:</strong>foobar</p>
    ... <p><strong>Procedures:</strong>lorem ipsum.</p>
    ... <p>dolor sit amet.</p>
    ... ...
    ... <p><strong>Description:</strong>curabitur at orci posuere.</p>
    ... <p>massa nec fringilla.</p>
    ... ''')
    >>> mapping = {}
    >>> key = None
    >>> for item in soup.find_all('p'):
    ...     if item.strong:
    ...         key = item.strong.get_text(strip=True).rstrip(':')
    ...         value = item.strong.next_sibling.strip()
    ...     else:
    ...         value = mapping[key] + ' ' + item.get_text(strip=True)
    ...     mapping[key] = value
    ... 
    >>> from pprint import pprint
    >>> pprint(mapping)
    {u'Class': u'foobar',
     u'Description': u'curabitur at orci posuere. massa nec fringilla.',
     u'ID': u'547',
     u'Procedures': u'lorem ipsum. dolor sit amet.'}
    

    This doesn't convert the ID to an integer; you could use a try: value = int(value), except ValueError: pass combo for that if you feel strongly about converting strings representing integers.