I'm trying to get article titles from yahoo news and organize it in a json file. When I dump the data to a json file it appears confusing to read. How would I go about organizing the data, either after the dump or from the beginning?
This for a web scraping project where I have to get top news articles and their bodies and export them to a json file which can then be sent to someone else's program. For now, I'm just working on getting the titles from the yahoo finance homepage.
import requests
import json
from bs4 import BeautifulSoup
#Getting webpage
page = requests.get("https://finance.yahoo.com/")
soup = BeautifulSoup(page.content, 'html.parser') #creating instance of class to parse the page
#Getting article title
title = soup.find_all(class_="Mb(5px)")
desc = soup.find_all(class_="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)-- sm1024 LineClamp(3,57px) LineClamp(3,51px)--sm1024 M(0)")
#Getting article bodies
page2 = requests.get("https://finance.yahoo.com/news/warren-buffett-suggests-read-19th-204800450.html")
soup2 = BeautifulSoup(page2.content, 'html.parser')
body = soup.find_all(class_="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm
Mt(0.8em)--sm", id="15")
#Organizing data for export
data = {'title1': title[0].get_text(),
'title2': title[1].get_text(),
'title3': title[2].get_text(),
'title4': title[3].get_text(),
'title5': title[4].get_text()}
#Exporting the data to results.json
with open("results.json", "w") as write_file:
json.dump(str(data), write_file)
This is what ends up being written on the json file (at the time of writing this post):
"{'title1': 'These US taxpayers face higher payments thanks to new law',
'title2': 'These 12 Stocks Are the Best Values in 2019, According to Pros
Who\u2019ve Outsmarted the Market', '\\ntitle3': 'The Best Move You Can
Make With Your Investments in 2019, According to 5 Market Professionals',
'title4': 'The auto industry said goodbye to a lot of cars in 2018',
'title5': '7 Stock Picks From Top-Rated Wall Street Analysts'}"
I would like to code to show each article title on a separate line and remove the random '\'s that appear in the middle.
I have run your code but I didn't get any result like that you got. You have defined 'title3' which is a constant, but you got '\n' which I didn't get actually in my case. By the way, you were getting /'s because you didn't encoded it correctly like 'utf8' and ascii ensure set to false. I would suggest two change like - 'lxml' parser not 'html.parser' and this code snippet:
with open("results.json", "w",encoding='utf8') as write_file:
json.dump(str(data), write_file ,ensure_ascii=False)
this totally worked for me /'s exclusion and ascii issues solved as well.