Search code examples
pythonpandasweb-scrapingbeautifulsoupexport-to-csv

Reading URLs from .csv and appending scrape results below previous with Python, BeautifulSoup, Pandas


I got this code to almost work, despite much ignorance. Please help on the home run!

  • Problem 1: INPUT:

I have a long list of URLs (1000+) to read from and they are in a single column in .csv. I would prefer to read from that file than to paste them into code, like below.

  • Problem 2: OUTPUT:

The source files actually have 3 drivers and 3 challenges each. In a separate python file, the below code finds, prints and saves all 3, but not when I'm using this dataframe below (see below - it only saves 2).

  • Problem 3: OUTPUT:

I want the output (both files) to have URLs in column 0, and then drivers (or challenges) in the following columns. But what I've written here (probably the 'drop') makes them not only drop one row but also move across 2 columns.

At the end I'm showing both the inputs and the current & desired output. Sorry for the long question. I'll be very grateful for any help!

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
dataframes = []
dataframes2 = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")

    def get_drivers():
        data = []
        for x in toc.select('li:-soup-contains-own("Market drivers") li'):
            data.append(x.get_text(strip=True))
        df = pd.DataFrame(data, columns=[url])
        dataframes.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes)
        tdata = df2.T
        tdata.to_csv(f'detail-dr.csv', header=True)

    get_drivers()


    def get_challenges():
        data = []
        for y in toc.select('li:-soup-contains-own("Market challenges") li'):
            data.append(y.get_text(strip=True).replace('Table Impact of drivers and challenges', ''))
        df = pd.DataFrame(data, columns=[url])
        dataframes2.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes2)
        tdata = df2.T
        tdata.to_csv(f'detail-ch.csv', header=True)

    get_challenges()

The inputs look like this in each URL. They are just lists:

Market drivers

  • Growing investment in fabs
  • Miniaturization of electronic products
  • Increasing demand for IoT devices

Market challenges

  • Rapid technological changes in semiconductor industry
  • Volatility in semiconductor industry
  • Impact of technology chasm Table Impact of drivers and challenges

My desired output for drivers is:

0 1 2 3
http/.../Global-Induction-Hobs-30196623/ Product innovations and new designs Increasing demand for convenient home appliances with changes in lifestyle patterns Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/ Demand for automated recruitment processes Increasing demand for unified solutions for all HR functions Increasing workforce diversity
http/.../Global-Probe-Card-30196643/ Growing investment in fabs Miniaturization of electronic products Increasing demand for IoT devices

But instead I get:

0 1 2 3 4 5 6
http/.../Global-Induction-Hobs-30196623/ Increasing demand for convenient home appliances with changes in lifestyle patterns Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/ Increasing demand for unified solutions for all HR functions Increasing workforce diversity
http/.../Global-Probe-Card-30196643/ Miniaturization of electronic products Increasing demand for IoT devices

Solution

  • Store your data in a list of dicts, create a data frame from it. Split the list of drivers / challenges into single columns and concat it to the final data frame.

    Example

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
    data = []
    
    for url in urls:
        page = requests.get(url)
        soup = BeautifulSoup(page.text, 'html.parser')
        toc = soup.find("div", id="toc")
    
        def get_drivers():
            data.append({
                'url':url,
                'type':'driver',
                'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
            })
    
        get_drivers()
    
    
        def get_challenges():
            data.append({
                'url':url,
                'type':'challenges',
                'list':[x.text.replace('Table Impact of drivers and challenges','') for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if x.text != 'Table Impact of drivers and challenges']
            })
    
        get_challenges()
    
        
    pd.concat([pd.DataFrame(data)[['url','type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],axis = 1)#.to_csv(sep='|')
    

    Output

    url type 0 1 2
    https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ driver Product innovations and new designs Increasing demand for convenient home appliances with changes in lifestyle patterns Growing adoption of energy-efficient appliances
    https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ challenges High cost limiting the adoption in the mass segment Health hazards related to induction hobs Limitation of using only flat - surface utensils and induction-specific cookware
    https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ driver Demand for automated recruitment processes Increasing demand for unified solutions for all HR functions Increasing workforce diversity
    https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ challenges Threat from open-source software High implementation and maintenance cost Threat to data security
    https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ driver Growing investment in fabs Miniaturization of electronic products Increasing demand for IoT devices
    https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ challenges Rapid technological changes in semiconductor industry Volatility in semiconductor industry Impact of technology chasm