I am trying to create a database with several articles for Text mining purposes. I am extracting the body via web scraping and then save the body of these articles on a csv file. However, I couldn't manage to save all the body texts. The code that I came up with saves only the text the last URL (article) while if I print what I am scraping (and what I am supposed to save) I obtain the body of all the articles.
I just included some of the URL from the list (which contains a larger number of URLs) just to give you an idea:
import requests
from bs4 import BeautifulSoup
import csv
r=["http://www.nytimes.com/2016/10/12/world/europe/germany-arrest-syrian-refugee.html",
"http://www.nytimes.com/2013/06/16/magazine/the-effort-to-stop-the- attack.html",
"http://www.nytimes.com/2016/10/06/world/europe/police-brussels-knife-terrorism.html",
"http://www.nytimes.com/2016/08/23/world/europe/france-terrorist-attacks.html",
"http://www.nytimes.com/interactive/2016/09/09/us/document-Review-of-the-San-Bernardino-Terrorist-Shooting.html",
]
for url in r:
t= requests.get(url)
t.encoding = "ISO-8859-1"
soup = BeautifulSoup(t.content, 'lxml')
text = soup.find_all(("p",{"class": "story-body-text story-content"}))
print(text)
with open('newdb30.csv', 'w', newline='') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=' ',quotechar='|', quoting=csv.QUOTE_MINIMAL)
spamwriter.writerow(text)
Try declaring variable such as all_text = ""
before the for loop and adding text
to all_text
by all_text += text + "\n"
at the end of the for loop (the \n
creates a new line).
Then, in the last row, instead of writing text
, you write all_text
.