Search code examples
pythonweb-scrapingbeautifulsouppython-requestsurllib

Retrieving content using Beautifulsoup and selectors


Trying to retrieve content (text) embedded in html. Not getting the content.

Trying to use selector in the format to find price_box:
price_box = soup2.find('div', attrs={'title class': 'Fw(600)'})

# Import libraries
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

# Set the URL you want to webscrape from
url = 'https://finance.yahoo.com/quote/NVDA?p=NVDA'

# Connect to the URL
response = requests.get(url)

# Parse HTML and save to BeautifulSoup object¶
soup = BeautifulSoup(response.text, "html.parser")

beta = soup.find('h1')
#print (beta)

link = beta.contents

variable = 'NVDA - NVIDIA Corporation'
test = 'NVDA - NVIDIA Corporation'

#<..>

url2 = 'https://finance.yahoo.com/calendar/earnings?from=2019-09-01&to=2019-09-07&day=2019-09-01'

response2 = requests.get(url2)
soup2 = BeautifulSoup(response2.text, "html.parser")
# alpha = soup2.find('')

# div = soup.find('a', {class_ ='D(ib) '})
# text = div.string

price_box = soup2.find('div', attrs={'title class': 'Fw(600)'})
#price = price_box.text
print("Price Box: "+ str(price_box)) # THIS IS WHAT I WANT

Was hoping to see "Senea". Instead seeing "None" - "Price Box: None"


Solution

  • A lot of the content is dynamic. You can regex out that info easily

    import requests, re
    
    p = re.compile(r'"YFINANCE:(.*?)"')
    r = requests.get('https://finance.yahoo.com/calendar/earnings?from=2019-09-01&to=2019-09-07&day=2019-09-01&guccounter=1')
    print(p.findall(r.text)[0])
    

    An alternate is to avoid the dynamic looking classes altogether

    import requests
    from bs4 import BeautifulSoup as bs
    
    r = requests.get('https://finance.yahoo.com/calendar/earnings?from=2019-09-01&to=2019-09-07&day=2019-09-01&guccounter=1')
    soup = bs(r.content, 'lxml')
    print(soup.select_one('#cal-res-table a').text)
    

    Reading:

    1. css selectors