Search code examples
htmlpython-3.xcsvbeautifulsoupexport-to-csv

Scraping and save data from URLs to csv using BeautifulSoup


Well, I am new to BS in Python. I have written a code that scrapes HTML and save all the data that I need in csv file. The values from the ALL_NUMBERS file are substituted into the URL and thus a large number of URLs are obtained.

The code is below:

import requests
from bs4 import BeautifulSoup

#--READ NAMES--
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.2)\
    AppleWebKit/537.36 (KHTML, like Gecko)\
    Chrome/63.0.3239.84 Safari/537.36',
    'Accept-Language': 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7'}
all_names = [] # TO KEEP ALL NAMES IN MEMORY

with open('ALL_NUMBERS.txt', 'r') as text_file:
    for line in text_file:
        line = line.strip()
        all_names.append(line)

url_template = 'https://www.investing.com/news/stock-market-news/learjet-the-private-plane-synonymous-with-the-jetset-nears-end-of-runway-{}'

all_urls = [] # TO KEEP ALL URLs IN MEMORY

with open("url_requests.txt", "w") as text_file:
    for name in all_names:
        url = url_template.format(name)
        print('url:', url)
        all_urls.append(url)
        text_file.write(url + "\n")

# --- read data ---

for name, url in zip(all_names, all_urls):
    # print('name:', name)
    # print('url:', url)
    r1 = requests.get(url, headers = headers)

page = r1.content
soup = BeautifulSoup(page, 'html5lib')
results = soup.find('div', class_= 'WYSIWYG articlePage')
para = results.findAll("p")
results_2 = soup.find('div', class_= 'contentSectionDetails')
para_2 = results_2.findAll ("span")
#for n in results_2:
    #print n.find('p').text

#cont = soup.select_one("div.contentSectionDetails")
#ram = cont.select_one("span")
#[x.extract() for x in ram.select_one('span')]


with open('stock_market_news_' + name + '.csv', 'w') as text_file:
    text_file.write(str(para))
    text_file.write(str(para_2))

It works well, but only with one URL. I want to save para and para_2 from each URL in one csv file. That is, save two parameters from each URL in each line:

Text Time
para From URL(1) para_2 From URL(1)
para From URL(2) para_2 From URL(2)
... ...

Unfortunately, I don't know how do it better for a lot of URLs in my case.


Solution

  • You could store all the params in a list and then save the result in your file:

    import csv
    
    # ...
    
    # --- read data ---
    
    params = []
    for name, url in zip(all_names, all_urls):
        r1 = requests.get(url, headers = headers)
        page = r1.content
        soup = BeautifulSoup(page, 'html5lib')
        results = soup.find('div', class_= 'WYSIWYG articlePage')
        para = '\n'.join([r.text for r in results.findAll("p")])
        results_2 = soup.find('div', class_= 'contentSectionDetails')
        para_2 = results_2.findAll("span")[0].text
        params.append([str(para), str(para_2)])
    
    with open('stock_market_news_' + name + '.csv', 'w') as text_file:
        text_file.write("Text;Time\n")
        wr = csv.writer(f, quoting=csv.QUOTE_ALL)
        wr.writerow(['Text', 'Time'])
        wr.writerows(params)
    
    

    Does this answer solve your problem?

    Have a nice day!