Search code examples
pythonpandasdataframeweb-scrapinghref

Scraping href using bs only returns the first link


I'm trying to scrape a table using bs and on one of the columns, there can be more than one link or href, such as the below example.

<td class="column-6">
    <a href="https://smallcaps.com.au/andean-mining-ipo-colombia-exploration-high-grade-copper-gold-target/" rel="noopener noreferrer" target="_blank">Article</a> / 
    <a href="https://www.youtube.com/watch?v=Kgew7tuLWCg" rel="noopener noreferrer" target="_blank">Video</a> / 
    <a href="https://andeanmining.com.au/" rel="noopener noreferrer" target="_blank">Website</a></td>

I am using the below code to locate the them however this only returns the first href, and doesn't return any of the others for rows that have more than one href.

from time import sleep
import numpy as np
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 3000)
from bs4 import BeautifulSoup

# Scrape the smallcaps website for IPO Information and save into dataframe
smallcaps_URL = "https://smallcaps.com.au/upcoming-ipos/"

service = Service("C:\Development\chromedriver_win32\chromedriver.exe")
chrome_options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service)

driver.get(smallcaps_URL)
sleep(3)
close_popup = driver.find_element(By.CLASS_NAME, "tve_ea_thrive_leads_form_close")
close_popup.click()

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

all_ipo_header = soup.find_all("th")
all_ipo_content = soup.find_all("td")

ipo_headers = []
ipo_contents = []

for header in all_ipo_header:
    ipo_headers.append(header.text.replace(" ", "_"))

for content in all_ipo_content:
    if content.a:
        a = content.find('a', href=True);
        ipo_contents.append(a['href'])
    else:
        ipo_contents.append(content.text)

# Prints complete scraped dataframe from SmallCaps website
df = pd.DataFrame(np.reshape(ipo_contents, (-1, 6)), columns=ipo_headers)
print(df)

# Next thing to do is scrape a few other websites for comparison and remove duplicates.

Current output

                     Company_name ASX_code Issue_price  Raise                     Focus                                        Information
0              Allup Silica (TBA)      APS       $0.20    $5m               Silica sand                           https://allupsilica.com/
1          Andean Mining (14 Feb)      ADM       $0.20    $6m       Mineral exploration  https://smallcaps.com.au/andean-mining-ipo-col...
2       Catalano Seafood (24 Feb)      CSF       $0.20    $6m                   Seafood                      https://www.catalanos.net.au/
3     Dragonfly Biosciences (TBA)      DRF       $0.20   $11m           Cannabidiol oil                  https://dragonflybiosciences.com/
4     Equity Story Group (18 Mar)      EQS       $0.20  $5.5m  Market advice & research                        https://equitystory.com.au/
5             Far East Gold (TBA)      FEG       $0.20   $12m       Mineral exploration  https://smallcaps.com.au/far-east-gold-asx-ipo...
6        Killi Resources (10 Feb)      KLI       $0.20    $6m           Gold and copper                          https://www.killi.com.au/
7           Lukin Resources (TBA)      LKN       $0.20  $7.5m       Mineral exploration  https://smallcaps.com.au/lukin-resources-launc...
8         Many Peaks Gold (2 Mar)      MPG       $0.20  $5.5m       Mineral exploration                          https://manypeaks.com.au/
9         Norfolk Metals (14 Mar)      NFL       $0.20  $5.5m          Gold and uranium                      https://norfolkmetals.com.au/
10    Omnia Metals Group (21 Feb)      OM1       $0.20  $5.5m       Mineral exploration                    https://www.omniametals.com.au/
11        Pure Resources (16 Mar)      PR1       $0.20  $4.6m       Mineral exploration                   http://www.pureresources.com.au/
12     Pinnacle Minerals (11 Mar)      PIM       $0.20  $5.5m        Kaolin - Haloysite                   https://pinnacleminerals.com.au/
13          Stelar Metals (7 Mar)      SLB       $0.20    $7m           Copper and zinc                       https://stelarmetals.com.au/
14        Top End Energy (21 Mar)      TEE       $0.20  $6.4m               Oil and gas                    http://www.topendenergy.com.au/
15  US Student Housing REIT (TBA)      USQ       $1.38   $45m  US student accommodation                              https://usq-reit.com/

Process finished with exit code 0

The expected output should have three links/hrefs for some rows the 'Information' column, however it is only returning the first link/href for all of them. Could someone please guide me in the right direction?

Solution

  • The below seems to work - it will look for all href items within content.a to allow multiple hrefs where available.

    for content in all_ipo_content:
        if content.a:
        all_urls = [content.get("href") for content in content.find_all('a')]
        ipo_contents.append(all_urls)