I am doing an assignment of web scraping on google colab with beautifulsoup and requests. Here I am only scraping the headline of google news. Below is the code:
import requests
from bs4 import BeautifulSoup
def beautiful_soup(url):
'''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT
INTO SOMETHING THAT IS EASY TO READ'''
request = requests.get(url)
soup = BeautifulSoup(request.text, "lxml")
print(soup.prettify())
beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')
for headlines in soup.find_all('a', {'class': 'VDXfz'}):
print(headlines.text)
The problem is that when I run the cell it neither shows the output (list of headlines) nor an error. Please help it is bugging me for 2 days.
You probably need to display the text from the next span
element. This could be done as follows:
import requests
from bs4 import BeautifulSoup
def beautiful_soup(url):
'''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT
INTO SOMETHING THAT IS EASY TO READ'''
request = requests.get(url)
soup = BeautifulSoup(request.text, "lxml")
#print(soup.prettify())
return soup
soup = beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')
for headlines in soup.find_all('a', {'class': 'VDXfz'}):
print(headlines.find_next('span').text)
This would give you output starting something like:
I Take Back My Comment, Says Ram Madhav After Omar Abdullah’s Dare to Prove Pakistan Charge
Ram Madhav Backpedals On "Instruction From Pak" After Omar Abdullah Dare
National Conference backed PDP to save J&K from uncertainty: Omar Abdullah
On Ram Madhav ‘instruction from Pak’ barb, Omar Abdullah’s stinging reply
Make public reports of horse-trading in govt formation in J-K: Omar Abdullah to Guv
You could write the headlines to a CSV formatted file using the following approach:
import requests
from bs4 import BeautifulSoup
import csv
def beautiful_soup(url):
'''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT
INTO SOMETHING THAT IS EASY TO READ'''
request = requests.get(url)
soup = BeautifulSoup(request.text, "lxml")
return soup
soup = beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')
with open('output.csv', 'w', newline='', encoding='utf-8') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['Headline'])
for headlines in soup.find_all('a', {'class': 'VDXfz'}):
headline = headlines.find_next('span').text
print(headline)
csv_output.writerow([headline])
Currently this just produces a single column called Headline