Search code examples
pythonbeautifulsouprequest

Retrieving text from <span> tag element issue


I am trying to scrape some details from a wepage. Specifically I am trying to retrieve a price. Here is the relevant html

Here is my code

import requests
from bs4 import BeautifulSoup

url = 'https://glomark.lk/coconut/p/11624'
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')

price_div = soup.find('div', class_='price')

if price_div:
    price_span = price_div.find('span')
    if price_span:
        print(price_span)
    else:
        print("no span class") 
else:
    print("no div class")

This code returns the text in between the span as " ".

Code reults: <span id="product-promotion-price"></span>

Should return: <span id="product-promotion-price">Rs 105.00</span>

I have tried this with a different webpage and a user agent. There I get the text as 0.00. html

import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Mozilla/5.0'}

url = 'https://scrape-sm1.github.io/site1/FLORA%20FACIAL%20TISSUES%202%20X%20160%20BOX%20-%20HOUSEHOLD%20-%20Categories%20market1super.com.html'
html = requests.get(url, headers=user_agent)
soup = BeautifulSoup(html.text, 'html.parser')

price_span= soup.find('span', class_='price')

if price_span:
  print(price_span)
else:
  print("no span class")

Code reults: <span class="price">Rs.0.00</span>

Should return: <span class="price">Rs.95.00</span>

Does anyone know why this happens. I would like some insights. Thanks.


Solution

  • When the webpage is sent, the field is 0. Then once the JavaScript loads and runs, it retrieves the price and updates that field. I verified this by opening DevTools and loading the page with the Network tab open, then reading the HTML sent originally. You are just loading the ram HTML, but no rendering it or executing any of the scripts like the browser would. Open the DevTools network tab, then load the page, and filter to "Fetch/XHR". You will see that the price comes from this URL: https://glomark.lk/product-page/variation-detail/11624 (Notice that it is a POST request, otherwise it will return a 404).

    Here is an example of how to get all the data that endpoint returns (the price is in here):

    import requests
    
    url = "https://glomark.lk/product-page/variation-detail/11624"
    headers = {
        "accept": "application/json, text/javascript, */*; q=0.01",
        "x-requested-with": "XMLHttpRequest"
    }
    
    referrer = "https://glomark.lk/coconut/p/11624"
    
    response = requests.post(url, headers=headers, data=None)
    print(response.text)