I want to scrape Product Information from Div below, but when I prettify the HTML I am not able to find the main DIV in HTML.
<div class="c2p6A5" data-qa-locator="product-item" data-tracking="product-card"
The elements I am trying to fetch is in the following script. I need to know how can I extract data from Script below:
<script type="application/ld+json"></script>
My code is as follows:
import requests
from bs4 import BeautifulSoup
url = "https://www.daraz.pk/catalog/?q=dell&_keyori=ss&from=input&spm=a2a0e.searchlist.search.go.57446b5079XMO8"
page = requests.get(url)
print(page.status_code)
print(page.text)
soup = BeautifulSoup(page.text, 'lxml')
print(soup.prettify())
just use .find()
or find_all()
when I do that, I see it's actually in json format, so then can just read that element and have all the data stored that way.
import requests
from bs4 import BeautifulSoup
import json
import re
url = "https://www.daraz.pk/catalog/?q=dell&_keyori=ss&from=input&spm=a2a0e.searchlist.search.go.57446b5079XMO8"
page = requests.get(url)
print(page.status_code)
print(page.text)
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())
alpha = soup.find_all('script',{'type':'application/ld+json'})
jsonObj = json.loads(alpha[1].text)
for item in jsonObj['itemListElement']:
name = item['name']
price = item['offers']['price']
currency = item['offers']['priceCurrency']
availability = item['offers']['availability'].split('/')[-1]
availability = [s for s in re.split("([A-Z][^A-Z]*)", availability) if s]
availability = ' '.join(availability)
url = item['url']
print('Availability: %s Price: %0.2f %s Name: %s' %(availability,float(price), currency,name))
Output:
Availability: In Stock Price: 82199.00 Rs. Name: DELL INSPIRON 15 5570 - 15.6"HD - CI5 - 8THGEN - 4GB - 1TB HDD - AMD RADEON 530 2GB GDDR5.
Availability: In Stock Price: 94599.00 Rs. Name: DELL INSPIRON 15 3576 - 15.6"HD - CI7 - 8THGEN - 4GB - 1TB HRD - AMD Radeon 520 with 2GB GDDR5.
Availability: In Stock Price: 106399.00 Rs. Name: DELL INSPIRON 15 5570 - 15.6"HD - CI7 - 8THGEN - 8GB - 2TB HRD - AMD RADEON 530 2GB GDDR5.
Availability: In Stock Price: 17000.00 Rs. Name: Dell Latitude E6420 14-inch Notebook 2.50 GHz Intel Core i5 4GB 320GB Laptop
Availability: In Stock Price: 20999.00 Rs. Name: Dell Core i5 6410 8GB Ram Wi-Fi Windows 10 Installed ( Refurb )
Availability: In Stock Price: 18500.00 Rs. Name: Core i-5 Laptop Dell 4GB Ram 15.6 " Display Windows 10 DVD+Rw ( Refurb )
Availability: In Stock Price: 8500.00 Rs. Name: Laptop Dell D620 Core 2 Duo 80_2Gb (Used)
...
EDIT: To see the difference in the 2 json structures:
jsonObj_0 = json.loads(alpha[0].text)
jsonObj_1 = json.loads(alpha[1].text)
print(json.dumps(jsonObj_0, indent=4, sort_keys=True))
print(json.dumps(jsonObj_1, indent=4, sort_keys=True))