Search code examples
pythonbeautifulsoupurllib

Python Web Scraping with Multiple URLs + merge datas


What I'm trying to do is

  • Take multiple URLs.
  • Take h2 text in every URL.
  • Merge h2 texts and then write csv.

In this code, I did: Take one URL. Take h2 text in URL.

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

page_url = "https://example.com/ekonomi/20200108/"

#i am trying to do | urls = ['https://example.com/ekonomi/20200114/', 'https://example.com/ekonomi/20200113/', 'https://example.com/ekonomi/20200112/', 'https://example.com/ekonomi/20200111/]

uClient = uReq(page_url)

page_soup = soup(uClient.read(), "html.parser")
uClient.close()

# finds each product from the store page
containers = page_soup.findAll("div", {"class": "b-plainlist__info"})

out_filename = "output.csv"

headers = "title \n"


f = open(out_filename, "w")
f.write(headers)

container = containers[0]

for container in containers:
    title = container.h2.get_text()

    f.write(title.replace(",", " ") + "\n")

f.close()  # Close the file

Solution

  • Provided your iteration through the containers is correct, this should work:

    You want to iterate through the urls. Each url will grab the title, and append it into a list. Then just create a series with that list and write to csv with Pandas:

    from bs4 import BeautifulSoup as soup
    from urllib.request import urlopen as uReq
    import pandas as pd
    
    
    urls = ['https://example.com/ekonomi/20200114/', 'https://example.com/ekonomi/20200113/', 'https://example.com/ekonomi/20200112/', 'https://example.com/ekonomi/20200111/']
    
    titles = []
    for page_url in urls:
        uClient = uReq(page_url)
    
        page_soup = soup(uClient.read(), "html.parser")
        uClient.close()
    
        # finds each product from the store page
        containers = page_soup.findAll("div", {"class": "b-plainlist__info"})
    
        for container in containers:
            titles.append(container.h2.get_text())
    
    df = pd.DataFrame(titles, columns=['title'])
    df.to_csv("output.csv", index=False)