I am scraping a website nykaa.com and the link is (https://www.nykaa.com/skin/moisturizers/serums-essence/c/8397?root=nav_3&page_no=1). There are 25 pages and the data loads dynamically per page. I am unable to find the source of the data. Moreover when Scrape the data I am only able to get 20 products which become redundant and the list becomes 420 products.
import requests
from bs4 import BeautifulSoup
import unicodecsv as csv
urls = []
l1 = []
for page in range(1,5):
result = requests.get("https://www.nykaa.com/skin/moisturizers/serums-essence/c/8397?root=nav_3&page_no=" + str(page))
src = result.content
soup = BeautifulSoup(src,'lxml')
for div_tag in soup.find_all("div", class_ = "card-wrapper-container col-xs-12 col-sm-6 col-md-4"):
for div1_tag in soup.find_all("div", class_ = "product-list-box card desktop-cart"):
h2_tag = div1_tag.find("h2").find("span")
price_tag = div1_tag.find("div", class_ = "price-info")
l1 = [h2_tag.get_text(),price_tag.get_text()]
urls.append(l1)
#print(urls)
with open('xyz.csv', 'wb') as myfile:
wr = csv.writer(myfile)
wr.writerows(urls)
The above code fetches me a list of around 1200 product names and prices, out of which only 30 to 40 are unique otherwise all are redundant. I want to fetch data of 25 pages uniquely and there are total 486 unique products. I also used selenium to click the next page link but that also didn't work out.
This shows making the request the page does (as viewed in network tab) in a loop over all pages (including determing number of pages). results
is a list of lists you can easily write to csv.
import requests, math, csv
page = '1'
def append_new_rows(data):
for i in data:
if 'name' in i:
results.append([i['name'], i['final_price']])
with requests.Session() as s:
r = s.get(f'https://www.nykaa.com/gludo/products/list?pro=false&filter_format=v2&app_version=null&client=react&root=nav_3&page_no={page}&category_id=8397').json()
results_per_page = 20
total_results = r['response']['total_found']
num_pages = math.ceil(total_results/results_per_page)
results = []
append_new_rows(r['response']['products'])
for page in range(2, num_pages + 1):
r = s.get(f'https://www.nykaa.com/gludo/products/list?pro=false&filter_format=v2&app_version=null&client=react&root=nav_3&page_no={page}&category_id=8397').json()
append_new_rows(r['response']['products'])
with open("data.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
w.writerow(['Name','Price'])
for row in results:
w.writerow(row)