I'm trying to scrape farefetch.com (https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282) with Beautifulsoup4 and I am not able to find the same components (tags or text in general) of the parsed text (dumped to soup.html) as in the browser in the dev tools view (when searching for matching strings with CTRL + F).
There is nothing wrong with my code but redardless of that here it is:
#!/usr/bin/python
# imports
import bs4
import requests
from bs4 import BeautifulSoup as soup
# parse website
url = 'https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282'
response = requests.get(url)
page_html = response.text
page_soup = soup(page_html, "html.parser")
# write parsed soup to file
with open("soup.html", "a") as dumpfile:
dumpfile.write(str(page_soup))
When I drag the soup.html file into the browser, all content loads as it should (like the real url). I assume it to be some kind of protection against parsing? I tried to put in a connection header which tells the webserver on the other side that I am requesting this from a real browser but it didnt work either.
When I search the wanted content in the browser it (obviously) shows up...
Here the parsed html saved as "soup.html". The content I am looking for can not be found, regardless of how I search (CTRL+F) or bs4 function find_all() or find() or what so ever.
Based on your comment, here is an example how you could extract some information from products that are on discount:
import requests
from bs4 import BeautifulSoup
url = "https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
for product in soup.select('[data-test="productCard"]:has([data-test="discountPercentage"])'):
link = 'https://www.farfetch.com' + product.select_one('a[itemprop="itemListElement"][href]')['href']
brand = product.select_one('[data-test="productDesignerName"]').get_text(strip=True)
desc = product.select_one('[data-test="productDescription"]').get_text(strip=True)
init_price = product.select_one('[data-test="initialPrice"]').get_text(strip=True)
price = product.select_one('[data-test="price"]').get_text(strip=True)
images = [i['content'] for i in product.select('meta[itemprop="image"]')]
print('Link :', link)
print('Brand :', brand)
print('Description :', desc)
print('Initial price :', init_price)
print('Price :', price)
print('Images :', images)
print('-' * 80)
Prints:
Link : https://www.farfetch.com/ch/shopping/men/dashiel-brahmann-printed-button-up-shirt-item-14100332.aspx?storeid=9359
Brand : Dashiel Brahmann
Description : printed button up shirt
Initial price : CHF 438
Price : CHF 219
Images : ['https://cdn-images.farfetch-contents.com/14/10/03/32/14100332_22273147_300.jpg', 'https://cdn-images.farfetch-contents.com/14/10/03/32/14100332_22273157_300.jpg']
--------------------------------------------------------------------------------
Link : https://www.farfetch.com/ch/shopping/men/dashiel-brahmann-corduroy-t-shirt-item-14100309.aspx?storeid=9359
Brand : Dashiel Brahmann
Description : corduroy T-Shirt
Initial price : CHF 259
Price : CHF 156
Images : ['https://cdn-images.farfetch-contents.com/14/10/03/09/14100309_21985600_300.jpg', 'https://cdn-images.farfetch-contents.com/14/10/03/09/14100309_21985606_300.jpg']
--------------------------------------------------------------------------------
... and so on.