Search code examples
pythonweb-scrapingbeautifulsouppython-requestsgoogle-colaboratory

Why am I not getting the output nor an error in web scraping?


I am doing an assignment of web scraping on google colab with beautifulsoup and requests. Here I am only scraping the headline of google news. Below is the code:

import requests
from bs4 import BeautifulSoup

def beautiful_soup(url):
'''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT 
INTO SOMETHING THAT IS EASY TO READ'''

request = requests.get(url)
soup = BeautifulSoup(request.text, "lxml")
print(soup.prettify())

beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')

for headlines in soup.find_all('a', {'class': 'VDXfz'}):
   print(headlines.text)

The problem is that when I run the cell it neither shows the output (list of headlines) nor an error. Please help it is bugging me for 2 days.


Solution

  • You probably need to display the text from the next span element. This could be done as follows:

    import requests
    from bs4 import BeautifulSoup
    
    def beautiful_soup(url):
        '''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT 
           INTO SOMETHING THAT IS EASY TO READ'''
    
        request = requests.get(url)
        soup = BeautifulSoup(request.text, "lxml")
        #print(soup.prettify())
        return soup
    
    soup = beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')
    
    for headlines in soup.find_all('a', {'class': 'VDXfz'}):
        print(headlines.find_next('span').text)
    

    This would give you output starting something like:

    I Take Back My Comment, Says Ram Madhav After Omar Abdullah’s Dare to Prove Pakistan Charge
    Ram Madhav Backpedals On "Instruction From Pak" After Omar Abdullah Dare
    National Conference backed PDP to save J&K from uncertainty: Omar Abdullah
    On Ram Madhav ‘instruction from Pak’ barb, Omar Abdullah’s stinging reply
    Make public reports of horse-trading in govt formation in J-K: Omar Abdullah to Guv
    

    You could write the headlines to a CSV formatted file using the following approach:

    import requests
    from bs4 import BeautifulSoup
    import csv
    
    def beautiful_soup(url):
        '''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT 
           INTO SOMETHING THAT IS EASY TO READ'''
    
        request = requests.get(url)
        soup = BeautifulSoup(request.text, "lxml")
        return soup
    
    soup = beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')
    
    with open('output.csv', 'w', newline='', encoding='utf-8') as f_output:
        csv_output = csv.writer(f_output)
        csv_output.writerow(['Headline'])
    
        for headlines in soup.find_all('a', {'class': 'VDXfz'}):
            headline = headlines.find_next('span').text
            print(headline)
            csv_output.writerow([headline])
    

    Currently this just produces a single column called Headline