Search code examples
pythonparsingdictionaryrssatom-feed

Find a value by substring in a nested dictionary


Just to give my problem a context: I am writing a Django webapp that includes several applications. One of them is used to display articles from RSS feeds. For now, I was only displaying the link, source and description. I want to add thumbnails to these articles. I'm trying to grab these thumbnails for any RSS or ATOM feed. Theses feeds are for some parts (e.g. images) constructed in totally arbitrary ways. Since I don't want to write a specific script for every feed on the Web, my idea is to look for ".jpg", ".png" substrings in every article I fetch and get that URL. Getting from RSS or ATOM feeds to articles is well handled by the python Feedparser module, and outputs this for example:

 {'guidislink': False,
  'href': '',
  'id': 'http://www.bbc.co.uk/sport/football/39426760',
  'link': 'http://www.bbc.co.uk/sport/football/39426760',
  'links': [{'href': 'http://www.bbc.co.uk/sport/football/39426760',
             'rel': 'alternate',
             'type': 'text/html'}],
  'media_thumbnail': [{'height': '576',
                       'url': 'http://c.files.bbci.co.uk/44A9/production/_95477571_joshking2.jpg',
                       'width': '1024'}],
  'published': 'Wed, 05 Apr 2017 21:49:14 GMT',
  'published_parsed': time.struct_time(tm_year=2017, tm_mon=4, tm_mday=5, tm_hour=21, tm_min=49, tm_sec=14, tm_wday=2, tm_yday=95, tm_isdst=0),
  'summary': 'Joshua King scores a dramatic late equaliser for Bournemouth as '
             'Liverpool drop two crucial points at Anfield.',
  'summary_detail': {'base': 'http://feeds.bbci.co.uk/news/rss.xml',
                     'language': None,
                     'type': 'text/html',
                     'value': 'Joshua King scores a dramatic late equaliser '
                              'for Bournemouth as Liverpool drop two crucial '
                              'points at Anfield.'},
  'title': 'Liverpool 2-2 Bournemouth',
  'title_detail': {'base': 'http://feeds.bbci.co.uk/news/rss.xml',
                   'language': None,
                   'type': 'text/plain',
                   'value': 'Liverpool 2-2 Bournemouth'}}

Here, http://c.files.bbci.co.uk/44A9/production/_95477571_joshking2.jpg is somewhere nested in lists and dictionaries. While I know how to access it in this specific case, the structures of feeds widely vary. Mainly:

  • The dictionary key holding the url is not always the same
  • The 'deepness' of where the url might be nested is not always the same

However, what is almost always the case is that an url with an image extension is the thumbnail of that article. How do I get that url?

To frame it out a little more, for now I use helper functions (based on the Feedparser module) that processes a feeds context variable, which is a dictionary, usable in my templates. I do the looping and displaying of title, description etc directly in my templates, since they are consistently a part of that dictionary thanks to feedparser:

...
{% for feed in feeds %}
  <h3>{{ feed.feed.title }}</h3>
  {% for entry in feed.entries %}
...

On the backend :

def parse_feeds(urls):
    parsed_feeds = []
    for url in urls:
        parsed_feed = feedparser.parse(url)
        parsed_feeds.append(parsed_feed)
    return parsed_feeds

class IndexView(generic.ListView):
    template_name = 'publisher/index.html'

    def get_context_data(self, **kwargs):
        context = super(IndexView,self).get_context_data(**kwargs)
        reacted_feeds = RSSArticle.objects.all()
        context['reacted_feeds'] = reacted_feeds
        parsed_feeds = parse_feeds(urls)
        delete_existing_entries(parsed_feeds)
        context['feeds'] = parsed_feeds
        return context

So basically every time you call that IndexView, you get the list of all articles from the feeds you subscribed to. That's where I want to include the image, which are not provided by Feedparser due to the inconsistent nature of their location in feeds.

If I want to include these pictures, at a macro level I basically have two solutions:

  • Writing something in addition to the existing system, but that might hurt performance because of too many things having to happen at the same time
  • Rewriting the whole thing, which might also hurt performance and consistency because I don't take advantage of Feedparser's power anymore

Maybe I should just keep the raw XML and try my luck with Beautifulsoup instead of translating to a dictionary with Feedparser.

PS : here is another example where the image is located somewhere else.

{'guidislink': False,
 'id': 'http://www.lemonde.fr/tiny/5106451/',
 'link': 'http://www.lemonde.fr/les-decodeurs/article/2017/04/05/presidentielle-les-grands-clivages-qui-divisent-les-onze-candidats_5106451_4355770.html?xtor=RSS-3208',
 'links': [{'href': 'http://www.lemonde.fr/les-decodeurs/article/2017/04/05/presidentielle-les-grands-clivages-qui-divisent-les-onze-candidats_5106451_4355770.html?xtor=RSS-3208',
            'rel': 'alternate',
            'type': 'text/html'},
           {'href': 'http://s1.lemde.fr/image/2017/04/05/644x322/5106578_3_0f2b_sur-le-plateau-du-debat-de-bfmtv-et-cnews_0e90a3db44861847870cfa1e4c3793b1.jpg',
            'length': '40057',
            'rel': 'enclosure',
            'type': 'image/jpeg'}],
 'published': 'Wed, 05 Apr 2017 17:02:38 +0200',
 'published_parsed': time.struct_time(tm_year=2017, tm_mon=4, tm_mday=5, tm_hour=15, tm_min=2, tm_sec=38, tm_wday=2, tm_yday=95, tm_isdst=0),
 'summary': 'Protection sociale, Europe, identité… Avec leurs programmes, les '
            'proximités idéologiques entre candidats bousculent de plus en '
            'plus le traditionnel axe «\xa0gauche-droite\xa0».',
 'summary_detail': {'base': 'http://www.lemonde.fr/rss/une.xml',
                    'language': None,
                    'type': 'text/html',
                    'value': 'Protection sociale, Europe, identité… Avec leurs '
                             'programmes, les proximités idéologiques entre '
                             'candidats bousculent de plus en plus le '
                             'traditionnel axe «\xa0gauche-droite\xa0».'},
 'title': 'Présidentielle\xa0: les grands clivages qui divisent les onze '
          'candidats',
 'title_detail': {'base': 'http://www.lemonde.fr/rss/une.xml',
                  'language': None,
                  'type': 'text/plain',
                  'value': 'Présidentielle\xa0: les grands clivages qui '
                           'divisent les onze candidats'}}

Solution

  • I wrote a solution based on this snippet.

    def get_image_url(substring, dictionary):
        for key, value in dictionary.items():
            # try is for handling Booleans
            try:
                if substring in value:
                    yield value
                elif isinstance(value, dict):
                    for result in get_image_url(substring, value):
                        yield result
                elif isinstance(value, list):
                    for list_item in value:
                        for result in get_image_url(substring, list_item):
                            yield result
            except:
                pass
    
    >>> list(get_image_url('.jpg', article_dict))
    >>> ['https://static01.nyt.com/images/2017/04/09/us/10OBAMA-alt/10OBAMA-alt-moth.jpg']
    

    PS : while it does not answer the exact question of finding a value in a nested dictionary, I found out that a good way to get images for articles from RSS feeds in a consistent manner is simply to follow back the URL to the original article, parse the HTML and search for the og:image tag.