Search code examples
pythonjsonbeautifulsouphttp-status-code-403

403 error with some pages when extracting data on some URLs


"Hello, can you help me? When trying to extract a JSON file from a webpage, it works with some URLs from the same page, but with others, I get a 403 error. The URLs are:"

ok: https://www.falabella.com/falabella-cl/category/cat16510006/Electrohogar?facetSelected=true&f.derived.variant.sellerId=FALABELLA%3A%3ASODIMAC%3A%3ATOTTUS&page=1

error 403: https://www.falabella.com/falabella-cl/category/cat7330051/Mujer?facetSelected=true&f.derived.variant.sellerId=FALABELLA&page=1

my sample code:

import requests
import json
from bs4 import BeautifulSoup


session = requests.Session()
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
})


def extract_json_from_falabella(url):
    try:
        response = session.get(url)
        response.raise_for_status()  # Lanza una excepción si la respuesta no es exitosa (código 2xx)

        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
           
            script_tag = soup.find('script', id='__NEXT_DATA__')

            if script_tag:
               
                json_text = script_tag.string.strip()
                data = json.loads(json_text)
                return data
            else:
                print("No se encontró el script con id='__NEXT_DATA__'.")
                return None
        else:
            print(f"Error al realizar la solicitud: {response.status_code}")
            return None

    except requests.exceptions.HTTPError as http_err:
        print(f"Error HTTP: {http_err}")
        return None
    except Exception as err:
        print(f"Ocurrió un error: {err}")
        return None


url = "https://www.falabella.com/falabella-cl/category/cat7330051/Mujer?facetSelected=true&f.derived.variant.sellerId=FALABELLA%3A%3ASODIMAC&page=1"
data = extract_json_from_falabella(url)

if data:
  
    with open('falabella_data.json', 'w', encoding='utf-8') as json_file:
        json.dump(data, json_file, ensure_ascii=False, indent=4)
    print("Datos guardados en 'falabella_data.json'")
else:
    print("No se pudieron extraer los datos JSON.")

can you see the problem?


Solution

  • This is Cloudflare protection, I don't know why it's only applied on some paths but not others, but this is passive protection and it uses tls/ja3/http2 fingerprinting to block bots/scraping.

    Fortunately it can be bypassed in this scenario by impersonating the browser's fingerprints with curl_cffi which has a requests like api.

    Since this site uses an api, we can retrieve data directly in json format, instead of extracting it from the html.

    The code below will retrieve the results for the this page: https://www.falabella.com/falabella-cl/category/cat7330051/Mujer?facetSelected=true&f.derived.variant.sellerId=FALABELLA&page=1

    from curl_cffi import requests
    
    def get_pid():
        url = 'https://www.falabella.com/s/geo/v2/districts/cl?politicalId=default'
        response = requests.get(url)
        data = response.json().get('data', {})
        return data.get('politicalId')
    
    
    api_url = "https://www.falabella.com/s/browse/v1/listing/cl"
    
    # pid does not seem to change/expire so you can replace it with string value
    pid = get_pid()
    
    params = {
        'f.derived.variant.sellerId': 'FALABELLA',
        'facetSelected': True,
        'page': 1,
        'categoryId': 'cat7330051',
        'categoryName': 'Mujer',
        'pid': pid,
    }
    
    response = requests.get(api_url, params=params, impersonate='chrome')
    data = response.json()['data']
    
    pagination = data['pagination']
    results = data['results']
    
    print(f'{len(results) = }')
    

    Don't forget to install curl_cffi using pip:

    pip install curl_cffi --upgrade
    

    Note: I have removed 2 params (pgid & zones) that did not seem to do anything, if you notice any discrepancy between these results and the ones in the html (__NEXT_DATA__) you could try adding them back (copy from devtools).