Search code examples
python-3.xrssfeedparserhashlibhash-function

Get new items from rss feed


I am using python feedparser to parse some rss feeds(every 2 hours), unfortunately the rss feeds do not contain etag or modified values and therefore whenever I parse the feeds I get the entire data everytime. I am thinking of creating a hash of entries returned from feedparser.parse and store that in the database, so that next time when I parse again I can compare against the hash and see if feed has changed and only then kick off parsing for each item in the feed My questions

  1. Is there any other/better way to see if rss feed has updated
  2. How do I create the hash? Is it enough to just do the following

    import hashlib 
    hash_object = hashlib.sha256(<FEEDPARSER_RESPONSE>)
    hex_dig = hash_object.hexdigest() 
    
  3. Store the hex_dig in the database


Solution

  • It's plausible to hash the FEEDPARSER_RESPONSE, especially if the etag or modified values don't exist in your feed. You didn't provide the link for your RSS feed, so I'm using one from CNN for my answer.

    import hashlib
    import feedparser
    
    cnn_top_news = feedparser.parse('http://rss.cnn.com/rss/cnn_topstories.rss')
    
    # I using entries, because in testing it gave me the same hash.
    news_updated = cnn_top_news.entries
    
    ###################################################################
    # During testing all of these items worked for creating the hash.
    # So there are multiple options to choice from.   
    #
    # cnn_top_news['entries']
    # titles = [entry.title for entry in cnn_top_news['entries']]
    # summaries = [entry.summary for entry in cnn_top_news['entries']]
    ###################################################################
    
    hash_object = hashlib.sha256(str(news_updated).encode('utf-8'))
    hex_dig = hash_object.hexdigest()
    
    print (hex_dig)
    # output 
    371c5730c7f1407878a32a814bc72542b48a43e1f7670eae0627d2617289161b