Search code examples
python-3.xweb-scrapingweb-applicationsbeautifulsouphtml-parsing

bs4 parses different html than browser


I'm trying to scrape farefetch.com (https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282) with Beautifulsoup4 and I am not able to find the same components (tags or text in general) of the parsed text (dumped to soup.html) as in the browser in the dev tools view (when searching for matching strings with CTRL + F).

There is nothing wrong with my code but redardless of that here it is:

#!/usr/bin/python 
# imports
import bs4
import requests
from bs4 import BeautifulSoup as soup

# parse website
url = 'https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282'
response = requests.get(url)
page_html = response.text
page_soup = soup(page_html, "html.parser")

# write parsed soup to file
with open("soup.html", "a") as dumpfile:
    dumpfile.write(str(page_soup))

When I drag the soup.html file into the browser, all content loads as it should (like the real url). I assume it to be some kind of protection against parsing? I tried to put in a connection header which tells the webserver on the other side that I am requesting this from a real browser but it didnt work either.

  1. Has anyone encountered something similar before?
  2. Is there a way to get the REAL html as shown in the browser?

When I search the wanted content in the browser it (obviously) shows up...

enter image description here

Here the parsed html saved as "soup.html". The content I am looking for can not be found, regardless of how I search (CTRL+F) or bs4 function find_all() or find() or what so ever.

the parsed content is not the same as the content displayd in the browser


Solution

  • Based on your comment, here is an example how you could extract some information from products that are on discount:

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.farfetch.com/ch/shopping/men/sale/all/items.aspx?page=1&view=180&scale=282"
    
    soup = BeautifulSoup(requests.get(url).text, 'html.parser')
    
    for product in soup.select('[data-test="productCard"]:has([data-test="discountPercentage"])'):
    
        link = 'https://www.farfetch.com' + product.select_one('a[itemprop="itemListElement"][href]')['href']
        brand = product.select_one('[data-test="productDesignerName"]').get_text(strip=True)
        desc = product.select_one('[data-test="productDescription"]').get_text(strip=True)
        init_price = product.select_one('[data-test="initialPrice"]').get_text(strip=True)
        price = product.select_one('[data-test="price"]').get_text(strip=True)
        images = [i['content'] for i in product.select('meta[itemprop="image"]')]
    
        print('Link          :', link)
        print('Brand         :', brand)
        print('Description   :', desc)
        print('Initial price :', init_price)
        print('Price         :', price)
        print('Images        :', images)
        print('-' * 80)
    

    Prints:

    Link          : https://www.farfetch.com/ch/shopping/men/dashiel-brahmann-printed-button-up-shirt-item-14100332.aspx?storeid=9359
    Brand         : Dashiel Brahmann
    Description   : printed button up shirt
    Initial price : CHF 438
    Price         : CHF 219
    Images        : ['https://cdn-images.farfetch-contents.com/14/10/03/32/14100332_22273147_300.jpg', 'https://cdn-images.farfetch-contents.com/14/10/03/32/14100332_22273157_300.jpg']
    --------------------------------------------------------------------------------
    Link          : https://www.farfetch.com/ch/shopping/men/dashiel-brahmann-corduroy-t-shirt-item-14100309.aspx?storeid=9359
    Brand         : Dashiel Brahmann
    Description   : corduroy T-Shirt
    Initial price : CHF 259
    Price         : CHF 156
    Images        : ['https://cdn-images.farfetch-contents.com/14/10/03/09/14100309_21985600_300.jpg', 'https://cdn-images.farfetch-contents.com/14/10/03/09/14100309_21985606_300.jpg']
    --------------------------------------------------------------------------------
    
    ... and so on.