Search code examples
pythonpandasdataframegoogle-news

News API - Getting output into Pandas DataFrame


I have successfully managed to call the News API and get the results into a DataFrame but only for Page 1.

def get_articles(keyword):

  all_articles = newsapi.get_everything(q=keyword, sources='abc-news-au, news-com-au',
                                      domains='http://www.abc.net.au/news, http://www.news.com.au',
                                      from_param='2018-12-28',
                                      to='2019-01-28',
                                      language='en',
                                      sort_by='popularity',
                                      page=1)

  all_articles = pd.DataFrame(all_articles)
  all_articles = pd.concat([all_articles.drop(['articles'], axis=1), all_articles['articles'].apply(pd.Series)], axis=1)

  return all_articles

enter image description here

It gives me the dataframe I want, however, when I try and loop through the following pages, I become unstuck.

I have tried the following

empty_list = []

for i in range(1,4,1):
  all_articles = all_articles = newsapi.get_everything(q=keyword, sources='abc-news-au, news-com-au',
                                  domains='http://www.abc.net.au/news, http://www.news.com.au',
                                  from_param='2018-12-28',
                                  to='2019-01-28',
                                  language='en',
                                  sort_by='popularity',
                                  page=i)
  empty_list.append(all_articles)

This returns all articles, however it is a dictionary stored in a list.

[{'articles': [{'author': None,
    'content': 'Updated \r\nJanuary 14, 2019 14:33:00\r\nANZ customers have lost access to banking services at their local post offices after the bank failed to reach an agreement with Australia Post on their Bank@Post service.\r\nThe change, which came into effect last night, wil… [+5084 chars]',
    'description': 'ANZ customers can no longer utilise banking services at their local post offices after the bank failed to reach an agreement with Australia Post on their Bank@Post service.',
    'publishedAt': '2019-01-14T03:14:57Z',
    'source': {'id': 'abc-news-au', 'name': 'ABC News (AU)'},
    'title': "ANZ customers 'furious' as access to Bank@Post cancelled",
    'url': 'https://www.abc.net.au/news/2019-01-14/anz-customers-lose-banking-service-at-australia-post/10713156',
    'urlToImage': 'https://www.abc.net.au/news/image/10710052-16x9-700x394.jpg'},
   {'author': 'Stephen Letts',
    'content': "Posted \r\nJanuary 26, 2019 06:20:15\r\nIf you think AMP's glum market update of an additional $200 million worth of costs to fix its various scandals rules a line under the sordid and sorry mess, think again.\r\nKey points:\r\nRemediation costs for Australia's scand… [+5019 chars]",
    'description': "Australia's six big wealth managers currently have provisions for about $2.6 billion to fix the scandals that have emerged from the banking royal commission. That could be be woefully inadequate.",
    'publishedAt': '2019-01-25T19:20:15Z',
    'source': {'id': 'abc-news-au', 'name': 'ABC News (AU)'},
    'title': "Wealth managers' remediation costs set to soar",
    'url': 'https://www.abc.net.au/news/2019-01-26/wealth-manager-remediation-costs-set-to-soar/10749810',
    'urlToImage': 'https://www.abc.net.au/news/image/1147126-16x9-700x394.jpg'}]

Previously, it was just a dictionary [no list].

When I do some transformation [similar to the above], I get the following DataFrame

enter image description here

Questions:

  1. Does anyone know a better way?
  2. If you were to work with the current dataframe, how would you get the dictionary out of each column and present it so it looks like the first dataframe?

Any help would be appreciated.

PS: if you want to replicate, you can copy my code - you will just need to get your own API key from: https://newsapi.org/docs/client-libraries/python


Solution

  • It looks like you want to pull out the articles values and extend rather than append:

    articles = []
    
    for i in range(1,4,1):
        articles_page = newsapi.get_everything(
                q=keyword,
                sources='abc-news-au, news-com-au',
                domains='http://www.abc.net.au/news, http://www.news.com.au',
                from_param='2018-12-28',
                to='2019-01-28',
                language='en',
                sort_by='popularity',
                page=i)
        articles.extend(articles_page['articles'])
    
    # outside of the loop, create the DataFrame
    pd.DataFrame(articles)