Search code examples
pythonpandasrssfeedparser

Import RSS with FeedParser and Get Both Posts and General Information to Single Pandas DataFrame


I am working on as a python novice on an exercise to practice importing data in python. Eventually I want to analyze data from different podcasts (infos on the podcasts itself and every episode) by putting the data into a coherent dataframe work on it with NLP.

So far I have managed to read a list of RSS feeds and get the information on every single episode of the RSS feed (a post).

But I am having trouble to find an integrated working process in python to gather both

  1. information on every single episode of the RSS feed (a post)
  2. and general information about the RSS feed (like title of the podcast) in one go.

Code This is what i have got so far

import feedparser
import pandas as pd

rss_feeds = ['http://feeds.feedburner.com/TEDTalks_audio',
        'https://joelhooks.com/rss.xml',
        'https://www.sciencemag.org/rss/podcast.xml',
    ]
#number of feeds is reduced for testing

posts = []
feed = []
for url in rss_feeds:
       feed = feedparser.parse(url)
       for post in feed.entries:
           posts.append((post.title, post.link, post.summary))

df = pd.DataFrame(posts, columns=['title', 'link', 'summary'])

Output The dataframe includes 652 non-null objects for three columns (as intended) - basically every post made in every podcast. The column title refers to the title of the episode but not to the title of the podcast (which in this example is 'Ted Talk Daily').

title link summary
0 3 questions to ask yourself about everything y... https://www.ted.com/talks/stacey_abrams_3_ques... How you respond to setbacks is what defines yo...
1 What your sleep patterns say about your relati... https://www.ted.com/talks/tedx_shorts_what_you... Wendy Troxel looks at the cultural expectation...
2 How we can actually pay people enough -- with ... https://www.ted.com/talks/ted_business_how_we_... Capitalism urgently needs an upgrade, says Pay...

I am struggling to find a way on how to include the title of the podcasts to this dataframe, too. I always get an error selecting parts the whole feed information e.g. ['feed']['title'].

Thanks for every hint with this!

Source I accustomed what I have so far based on this source: Get Feeds from FeedParser and Import to Pandas DataFrame


Solution

  • Feed title can be accessed in this case with feed.feed.title:

    # ...
    for url in rss_feeds:
        feed = feedparser.parse(url)
        for post in feed.entries:
            posts.append((feed.feed.title, post.title, post.link, post.summary))
    
    df = pd.DataFrame(posts, columns=['feed_title', 'title', 'link', 'summary'])
    df
    

    Output:

              feed_title            title             link          summary
    0    TED Talks Daily  3 ways compa...  https://www....  When we expe...
    1    TED Talks Daily  How we could...  https://www....  Concrete is ...
    2    TED Talks Daily  3 questions ...  https://www....  How you resp...
    3    TED Talks Daily  What your sl...  https://www....  Wendy Troxel...
    4    TED Talks Daily  How we can a...  https://www....  Capitalism u...
    ..               ...              ...              ...              ...
    649  Science Maga...  Science Podc...  https://traf...  Fear-enhance...
    650  Science Maga...  Science Podc...  https://traf...  Discussing t...
    651  Science Maga...  Science Podc...  https://traf...  Talking kids...
    652  Science Maga...  Science Podc...  https://traf...  The minimum ...
    653  Science Maga...  Science Podc...  https://traf...  The origin o...