Search code examples
pythonrssbeautifulsoup

Parse all item elements with children from RSS feed with beautifulsoup


From an RSS feed, how do you get a string of everything that's inside each item tag?

Example input (simplified):

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>Test</title>
<item>
  <title>Hello world1</title>
  <comments>Hi there</comments>
  <pubDate>Tue, 21 Nov 2011 20:10:10 +0000</pubDate>
</item>
<item>
  <title>Hello world2</title>
  <comments>Good afternoon</comments>
  <pubDate>Tue, 22 Nov 2011 20:10:10 +0000</pubDate>
</item>
<item>
  <title>Hello world3</title>
  <comments>blue paint</comments>
  <pubDate>Tue, 23 Nov 2011 20:10:10 +0000</pubDate>
</item>
</channel>
</rss>

I need a python function that takes this RSS file (I'm using beautifulsoup now), and has a loop that goes through each item. I need a variable that has a string of everything within each item.

Example first loop result:

<title>Hello world1</title>
<comments>Hi there</comments>
<pubDate>Tue, 21 Nov 2011 20:10:10 +0000</pubDate>

This code gets me the first result, but how do I get all the next ones?

html_data = BeautifulSoup(xml)
print html_data.channel.item

Solution

  • Using BeautifulStoup 4:

    import bs4 as bs
    doc = bs.BeautifulSoup(xml, 'xml')
    for item in doc.findAll('item'):
        for elt in item:
            if isinstance(elt, BeautifulSoup.Tag):
                print(elt)
    

    And here's how you could do the same thing with lxml:

    import lxml.etree as ET
    doc = ET.fromstring(xml)
    for item in doc.xpath('//item'):
        for elt in item.xpath('descendant::*'):
            print(ET.tostring(elt))