I am trying to create a program that prints out the first 5 jokes from /r/Jokes but I am having some trouble formatting it to look nice. I want to have it set out like this.
Post Title: Post Content
For example, here is one of the jokes directly from the RSS feed:
<item>
<title>What do you call a stack of pancakes?</title>
<link>https://www.reddit.com/r/Jokes/comments/3ix348/what_do_you_call_a_stack_of_pancakes/</link>
<guid isPermaLink="true">https://www.reddit.com/r/Jokes/comments/3ix348/what_do_you_call_a_stack_of_pancakes/</guid>
<pubDate>Sun, 30 Aug 2015 03:18:00 +0000</pubDate>
<description><!-- SC_OFF --><div class="md"><p>A balanced breakfast</p> </div><!-- SC_ON --> submitted by <a href="http://www.reddit.com/user/TheRealCreamytoast"> TheRealCreamytoast </a> <br/> <a href="http://www.reddit.com/r/Jokes/comments/3ix348/what_do_you_call_a_stack_of_pancakes/">[link]</a> <a href="https://www.reddit.com/r/Jokes/comments/3ix348/what_do_you_call_a_stack_of_pancakes/">[2 comments]</a></description>
</item>
I am currently printing the title, followed by a colon and a space, and then the description. However it prints all the text, including the links, the author and all the HTML tags. How would I just get the text inside the paragraph tags.
Thanks,
EDIT: This is my code:
d = feedparser.parse('https://www.reddit.com/r/cleanjokes/.rss')
print("")
print("Pulling latest jokes from Reddit. https://www.reddit.com/r/cleanjokes")
print("")
time.sleep(0.8)
print("Displaying First 5 Jokes:")
print("")
print(d['entries'][0]['title'] + ": " + d['entries'][0]['description'])
print(d['entries'][1]['title'] + ": " + d['entries'][1]['description'])
print(d['entries'][2]['title'] + ": " + d['entries'][2]['description'])
print(d['entries'][3]['title'] + ": " + d['entries'][3]['description'])
print(d['entries'][4]['title'] + ": " + d['entries'][4]['description'])
This just gets the first 5 entries. What I need to do is format the description string after the colon to only include the text inside the paragraph tags.
Oren is right about using BeautifulSoup but I'll try to provide more complete answer.
d['entries'][0]['description']
returns html and you need to parse that. bs is great library for that.
You can install it using:
pip install beautifulsoup4
from bs4 import BeautifulSoup
soup = BeautifulSoup(d['entries'][0]['description'], 'html.parser')
print(soup.div.get_text())
Get's text from the div
part of the entry.