Just to give my problem a context: I am writing a Django webapp that includes several applications. One of them is used to display articles from RSS feeds. For now, I was only displaying the link, source and description. I want to add thumbnails to these articles. I'm trying to grab these thumbnails for any RSS or ATOM feed. Theses feeds are for some parts (e.g. images) constructed in totally arbitrary ways. Since I don't want to write a specific script for every feed on the Web, my idea is to look for ".jpg", ".png" substrings in every article I fetch and get that URL. Getting from RSS or ATOM feeds to articles is well handled by the python Feedparser module, and outputs this for example:
{'guidislink': False,
'href': '',
'id': 'http://www.bbc.co.uk/sport/football/39426760',
'link': 'http://www.bbc.co.uk/sport/football/39426760',
'links': [{'href': 'http://www.bbc.co.uk/sport/football/39426760',
'rel': 'alternate',
'type': 'text/html'}],
'media_thumbnail': [{'height': '576',
'url': 'http://c.files.bbci.co.uk/44A9/production/_95477571_joshking2.jpg',
'width': '1024'}],
'published': 'Wed, 05 Apr 2017 21:49:14 GMT',
'published_parsed': time.struct_time(tm_year=2017, tm_mon=4, tm_mday=5, tm_hour=21, tm_min=49, tm_sec=14, tm_wday=2, tm_yday=95, tm_isdst=0),
'summary': 'Joshua King scores a dramatic late equaliser for Bournemouth as '
'Liverpool drop two crucial points at Anfield.',
'summary_detail': {'base': 'http://feeds.bbci.co.uk/news/rss.xml',
'language': None,
'type': 'text/html',
'value': 'Joshua King scores a dramatic late equaliser '
'for Bournemouth as Liverpool drop two crucial '
'points at Anfield.'},
'title': 'Liverpool 2-2 Bournemouth',
'title_detail': {'base': 'http://feeds.bbci.co.uk/news/rss.xml',
'language': None,
'type': 'text/plain',
'value': 'Liverpool 2-2 Bournemouth'}}
Here, http://c.files.bbci.co.uk/44A9/production/_95477571_joshking2.jpg
is somewhere nested in lists and dictionaries. While I know how to access it in this specific case, the structures of feeds widely vary. Mainly:
However, what is almost always the case is that an url with an image extension is the thumbnail of that article. How do I get that url?
To frame it out a little more, for now I use helper functions (based on the Feedparser module) that processes a feeds
context variable, which is a dictionary, usable in my templates. I do the looping and displaying of title, description etc directly in my templates, since they are consistently a part of that dictionary thanks to feedparser:
...
{% for feed in feeds %}
<h3>{{ feed.feed.title }}</h3>
{% for entry in feed.entries %}
...
On the backend :
def parse_feeds(urls):
parsed_feeds = []
for url in urls:
parsed_feed = feedparser.parse(url)
parsed_feeds.append(parsed_feed)
return parsed_feeds
class IndexView(generic.ListView):
template_name = 'publisher/index.html'
def get_context_data(self, **kwargs):
context = super(IndexView,self).get_context_data(**kwargs)
reacted_feeds = RSSArticle.objects.all()
context['reacted_feeds'] = reacted_feeds
parsed_feeds = parse_feeds(urls)
delete_existing_entries(parsed_feeds)
context['feeds'] = parsed_feeds
return context
So basically every time you call that IndexView, you get the list of all articles from the feeds you subscribed to. That's where I want to include the image, which are not provided by Feedparser due to the inconsistent nature of their location in feeds.
If I want to include these pictures, at a macro level I basically have two solutions:
Maybe I should just keep the raw XML and try my luck with Beautifulsoup instead of translating to a dictionary with Feedparser.
PS : here is another example where the image is located somewhere else.
{'guidislink': False,
'id': 'http://www.lemonde.fr/tiny/5106451/',
'link': 'http://www.lemonde.fr/les-decodeurs/article/2017/04/05/presidentielle-les-grands-clivages-qui-divisent-les-onze-candidats_5106451_4355770.html?xtor=RSS-3208',
'links': [{'href': 'http://www.lemonde.fr/les-decodeurs/article/2017/04/05/presidentielle-les-grands-clivages-qui-divisent-les-onze-candidats_5106451_4355770.html?xtor=RSS-3208',
'rel': 'alternate',
'type': 'text/html'},
{'href': 'http://s1.lemde.fr/image/2017/04/05/644x322/5106578_3_0f2b_sur-le-plateau-du-debat-de-bfmtv-et-cnews_0e90a3db44861847870cfa1e4c3793b1.jpg',
'length': '40057',
'rel': 'enclosure',
'type': 'image/jpeg'}],
'published': 'Wed, 05 Apr 2017 17:02:38 +0200',
'published_parsed': time.struct_time(tm_year=2017, tm_mon=4, tm_mday=5, tm_hour=15, tm_min=2, tm_sec=38, tm_wday=2, tm_yday=95, tm_isdst=0),
'summary': 'Protection sociale, Europe, identité… Avec leurs programmes, les '
'proximités idéologiques entre candidats bousculent de plus en '
'plus le traditionnel axe «\xa0gauche-droite\xa0».',
'summary_detail': {'base': 'http://www.lemonde.fr/rss/une.xml',
'language': None,
'type': 'text/html',
'value': 'Protection sociale, Europe, identité… Avec leurs '
'programmes, les proximités idéologiques entre '
'candidats bousculent de plus en plus le '
'traditionnel axe «\xa0gauche-droite\xa0».'},
'title': 'Présidentielle\xa0: les grands clivages qui divisent les onze '
'candidats',
'title_detail': {'base': 'http://www.lemonde.fr/rss/une.xml',
'language': None,
'type': 'text/plain',
'value': 'Présidentielle\xa0: les grands clivages qui '
'divisent les onze candidats'}}
I wrote a solution based on this snippet.
def get_image_url(substring, dictionary):
for key, value in dictionary.items():
# try is for handling Booleans
try:
if substring in value:
yield value
elif isinstance(value, dict):
for result in get_image_url(substring, value):
yield result
elif isinstance(value, list):
for list_item in value:
for result in get_image_url(substring, list_item):
yield result
except:
pass
>>> list(get_image_url('.jpg', article_dict))
>>> ['https://static01.nyt.com/images/2017/04/09/us/10OBAMA-alt/10OBAMA-alt-moth.jpg']
PS : while it does not answer the exact question of finding a value in a nested dictionary, I found out that a good way to get images for articles from RSS feeds in a consistent manner is simply to follow back the URL to the original article, parse the HTML and search for the og:image
tag.