Search code examples
pythonweb-scrapingwebpython-requests

Why does web scraping a website using Python requests connect to a US server instead of a Greek one and return non-Greek content?


I LIVE IN GREECE / I HAVE A GREEK IP

I'm trying to web scrape a website using Python and the requests library, but I've noticed that the requests connect to a US server instead of a Greek one. Additionally, the content I get is not in Greek.

I've set the headers and user-agent to mimic a Greek user, but it doesn't seem to have any effect. Here's the Python script I'm using:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time
import firebase_admin
from firebase_admin import credentials, db
import asyncio

# Initialize Firebase with your credentials JSON file and database URL
cred = credentials.Certificate("C:\\Users\\alexl\\OneDrive\\Desktop\\Cook Group\\Scripts\\SNS\\creds.json")
firebase_admin.initialize_app(cred, {
    'databaseURL': 'https://sns-database-default-rtdb.europe-west1.firebasedatabase.app/'
})

# Function to load data from Firebase
# (snipped for brevity)

# Function to save data to Firebase
# (snipped for brevity)

# Your bot should be running inside an async function
async def main():
    while True:
        print("Refreshing data...")  # Debug message

        # Specify the URL you want to scrape
        url = "https://www.sneakersnstuff.com/en/176/nike-dunk"

        headers = {
            'accept-language': 'en-GR-0-0',
            "sns.state": "en-GR-0-0",
            "Cookie": "sns.state=en-GR-0-0",       
            'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"
        }
        # Load the previously scraped data from Firebase
        # (snipped for brevity)

        # Send an HTTP GET request to the URL
        response = requests.get(url, headers=headers)

        # Check if the request was successful (status code 200)
        # (snipped for brevity)

        print("Waiting for the next refresh...")  # Debug message

        # Wait for 60 seconds before the next refresh
        await asyncio.sleep(60)

# Ensure that the main function is run
if __name__ == "__main__":
    asyncio.run(main())

Here is websites cookies:

enter image description here

What should I do?


Solution

  • Set cookie PreferredRegion:

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.sneakersnstuff.com/en/176/nike-dunk"
    
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/119.0"
    }
    
    cookies = {"PreferredRegion": "2358"}
    
    soup = BeautifulSoup(
        requests.get(url, headers=headers, cookies=cookies).content, "html.parser"
    )
    
    region = soup.select_one('h3:-soup-contains("Region") + a').text
    print(region)
    

    Prints:

    GR:en