Search code examples
web-scrapingbeautifulsouppython-requestsscrapyurllib3

How is data scraping based on location in Amazon?


Whenever I want to scraping on amazon.com, I fail. Because Product information changes according to location in amazon.com

This changing information is as follows;

  • 1-Price
  • 2-Shipping fee
  • 3-Customs fee
  • 4-Shipping status

Changing the location with selenium is simple, but the processing speed is very slow. So That's why I need to scraping with scrapy or requests.

However, although I imitate cookies and headers as in the browser, amazon.com does not allow me to change the location.

There are two big problems.

  1. There is a data called "ubid-main", I cannot derive a copy of this data. This is amazon without data. It does not allow to change location.
  2. Although I do the same for the header data, there is a difference between the outgoing data. Example: I use the exact same header in the browser. but in the browser the Content-Type goes as json, but in the code I made, it is text / html; charset = UTF-8 going.

And it is very interesting that there is no information on this subject. You cannot do location-oriented scraping to the world's number one shopping site.

Please enlighten me who knows the answer to this topic. If there is a solution as scrapy or requests, it is sufficient. Seriously, I haven't solved this issue for 1 year.

import requests
from lxml import etree
from random import choice
from urllib3.exceptions import InsecureRequestWarning
import urllib.parse
import urllib3.request
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

    

def location():
    headersdelivery = {
            'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
            'content-type':'application/x-www-form-urlencoded',
            'accept':'text/html,*/*',
            'x-requested-with':'XMLHttpRequest',
            'contenttype':'application/x-www-form-urlencoded;charset=utf-8',
            'origin':'https://www.amazon.com',
            'sec-fetch-site':'same-origin',
            'sec-fetch-mode':'cors',
            'sec-fetch-dest':'empty',
            'referer':'https://www.amazon.com/',
            'accept-encoding':'gzip, deflate, br',
            'accept-language':'tr-TR,tr;q=0.9,en-US;q=0.8,en;q=0.7'
            }

    payload = {
    'locationType':'LOCATION_INPUT',
    'zipCode':'34249',
    'storeContext':'generic',
    'deviceType':'web',
    'pageType':'Gateway',
    'actionSource':'glow',
    'almBrandId':'undefined'}


    sessionid = requests.session()
    url = "https://www.amazon.com/gp/delivery/ajax/address-change.html"
    ulkesecmereq = sessionid.post(url, headers=headersdelivery, data=payload,verify=False)

    return sessionid


def response(locationsession):
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'TE': 'Trailers'}

    postdata = {
    'storeContext':'generic',
    'pageType':'Gateway'
    }
    req = locationsession.post("https://www.amazon.com/gp/glow/get-location-label.html",headers=headers, data=postdata, verify=False)
    print(req.content)


locationsession = location()
response(locationsession)

Solution

  • Firstly you should get the token anti-csrftoken-a2z from the base amazon page:

    1. Make a request to www.amazon.com with a specific User-Agent: Mozilla ...

    2. Get JSON data by XPATH selector:

    //span[@id='nav-global-location-data-modal-action']/@data-a-modal

    Sample of JSON from this selector:

    {
      "width": 375,
      "closeButton": "false",
      "popoverLabel": "Choose your location",
      "ajaxHeaders": {
        "anti-csrftoken-a2z": "ajaxHeaders >> anti-csrftoken-a2z"
      },
      "name": "glow-modal",
      "url": "/gp/glow/get-address-selections.html?deviceType=desktop&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal",
      "footer": "<span class=\"a-declarative\" data-action=\"a-popover-close\" data-a-popover-close=\"{}\"><span class=\"a-button a-button-primary\"><span class=\"a-button-inner\"><button name=\"glowDoneButton\" class=\"a-button-text\" type=\"button\">Done</button></span></span></span>",
      "header": "Choose your location"
    }
    
    1. Make headers to the next request:
    headers = {
        "anti-csrftoken-a2z": `gMDCYRgjYFVWvjfmU70/qMURqYh7kAko11WlenYAAAAMAAAAAGGokFZyYXcAAAAA`,
        "user-agent": "Mozila ..."
    }
    
    1. Make a request to the link: https://www.amazon.com/gp/glow/get-address-selections.html?deviceType=desktop&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal with headers from step 2 and response cookies from step 1.

    2. Extract CSRF_TOKEN from the response: Regex: 'CSRF_TOKEN : "(.+?)"'

    3. Make headers to the next request:

    headers = {
        "anti-csrftoken-a2z": "CSRF token from step 4",
        "user-agent": "Mozila ..."
    }
    
    1. Make a POST request to the: https://www.amazon.com/gp/delivery/ajax/address-change.html with formdata:
    {
            "locationType": "LOCATION_INPUT",
            "zipCode": "zip-code",
            "storeContext": "generic",
            "deviceType": "web",
            "pageType": "Gateway",
            "actionSource": "glow",
            "almBrandId": "undefined",
    }
    

    with headers from step 5 and response cookies from step 3.

    If all file you should get such response:

    {
        'isValidAddress': 1, 
        'isTransitOutOfAis': 0, 
        'address': {'locationType': 'LOCATION_INPUT', 'district': None, 
        'zipCode': '30322', 'addressId': None, 'isDefaultShippingAddress': 'false', 'obfuscatedId': None, 'isAccountAddress': 'false', 'state': 'GA', 
        'countryCode': 'US', 'addressLabel': None, 
        'city': 'ATLANTA', 'addressLine1': None}, 'sembuUpdated': 1
    }
    
    
    1. Save response cookies from step 6 and use them for the further requests

    Python script with all logic:

    import json
    
    import requests
    from parsel import Selector
    
    AMAZON_US_URL = "https://www.amazon.com/"
    AMAZON_ADDRESS_CHANGE_URL = (
        "https://www.amazon.com/gp/delivery/ajax/address-change.html"
    )
    AMAZON_CSRF_TOKEN_URL = (
        "https://www.amazon.com/gp/glow/get-address-selections.html?deviceType=desktop"
        "&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal"
    )
    DEFAULT_USER_AGENT = (
        "Mozilla/5.0 ..."
    )
    DEFAULT_REQUEST_HEADERS = {"Accept-Language": "en", "User-Agent": DEFAULT_USER_AGENT}
    
    
    def get_amazon_content(start_url: str, cookies: dict = None) -> tuple:
        response = requests.get(
            url=start_url, headers=DEFAULT_REQUEST_HEADERS, cookies=cookies
        )
        response.raise_for_status()
        return Selector(text=response.text), response.cookies
    
    
    def get_ajax_token(content: Selector):
        data = content.xpath(
            "//span[@id='nav-global-location-data-modal-action']/@data-a-modal"
        ).get()
        if not data:
            raise ValueError("Invalid page content")
        json_data = json.loads(data)
        return json_data["ajaxHeaders"]["anti-csrftoken-a2z"]
    
    
    def get_session_id(content: Selector):
        session_id = content.re_first(r'session: \{id: "(.+?)"')
        if not session_id:
            raise ValueError("Session id not found")
        return session_id
    
    
    def get_token(content: Selector):
        csrf_token = content.re_first(r'CSRF_TOKEN : "(.+?)"')
        if not csrf_token:
            raise ValueError("CSRF token not found")
        return csrf_token
    
    
    def send_change_location_request(zip_code: str, headers: dict, cookies: dict):
        response = requests.post(
            url=AMAZON_ADDRESS_CHANGE_URL,
            data={
                "locationType": "LOCATION_INPUT",
                "zipCode": zip_code,
                "storeContext": "generic",
                "deviceType": "web",
                "pageType": "Gateway",
                "actionSource": "glow",
                "almBrandId": "undefined",
            },
            headers=headers,
            cookies=cookies,
        )
        assert response.json()["isValidAddress"], "Invalid change response"
        return response.cookies
    
    
    def get_session_cookies(zip_code: str):
        response = requests.get(url=AMAZON_US_URL, headers=DEFAULT_REQUEST_HEADERS)
        content = Selector(text=response.text)
    
        headers = {
            "anti-csrftoken-a2z": get_ajax_token(content=content),
            "user-agent": DEFAULT_USER_AGENT,
        }
        response = requests.get(
            url=AMAZON_CSRF_TOKEN_URL, headers=headers, cookies=response.cookies
        )
        content = Selector(text=response.text)
    
        headers = {
            "anti-csrftoken-a2z": get_token(content=content),
            "user-agent": DEFAULT_USER_AGENT,
        }
        send_change_location_request(
            zip_code=zip_code, headers=headers, cookies=dict(response.cookies)
        )
        # Verify that location changed correctly.
        response = requests.get(
            url=AMAZON_US_URL, headers=DEFAULT_REQUEST_HEADERS, cookies=response.cookies
        )
        content = Selector(text=response.text)
        location_label = content.css("span#glow-ingress-line2::text").get().strip()
    
        assert zip_code in location_label
    
    
    if __name__ == "__main__":
        get_session_cookies(zip_code="30322")
    

    Also, the similar logic using Scrapy Framework:

    from http.cookies import SimpleCookie
    
    from scrapy import FormRequest, Request, Spider
    from scrapy.http import HtmlResponse
    
    
    class AmazonSessionSpider(Spider):
        """
        Amazon spider for extracting location cookies.
        """
    
        name = "amazon.com:location-session"
    
        address_change_endpoint = "/gp/delivery/ajax/address-change.html"
        csrf_token_endpoint = (
            "/gp/glow/get-address-selections.html?deviceType=desktop"
            "&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal"
        )
        countries_base_urls = {
            "US": "https://www.amazon.com",
            "GB": "https://www.amazon.co.uk",
            "DE": "https://www.amazon.de",
            "ES": "https://www.amazon.es",
        }
    
        default_headers = {
            "sec-fetch-site": "none",
            "sec-fetch-dest": "document",
            "accept-language": "ru-RU,ru;q=0.9",
            "connection": "close",
        }
    
        def __init__(self, country: str, zip_code: str, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.country = country
            self.zip_code = zip_code
    
        def start_requests(self):
            """
            Make start request to main Amazon country page.
            """
            request = Request(
                url=self.countries_base_urls[self.country],
                headers=self.default_headers,
                callback=self.parse_ajax_token,
            )
            yield request
    
        def parse_ajax_token(self, response: HtmlResponse):
            """
            Parse ajax token from response.
            """
            yield response.request.replace(
                url=self.countries_base_urls[self.country] + self.csrf_token_endpoint,
                headers={
                    "anti-csrftoken-a2z": self.get_ajax_token(response=response),
                    **self.default_headers,
                },
                callback=self.parse_csrf_token,
            )
    
        def parse_csrf_token(self, response: HtmlResponse):
            """
            Parse CSRF token from response and make request to change Amazon location.
            """
            yield FormRequest(
                method="POST",
                url=self.countries_base_urls[self.country] + self.address_change_endpoint,
                formdata={
                    "locationType": "LOCATION_INPUT",
                    "zipCode": self.zip_code,
                    "storeContext": "generic",
                    "deviceType": "web",
                    "pageType": "Gateway",
                    "actionSource": "glow",
                    "almBrandId": "undefined",
                },
                headers={
                    "anti-csrftoken-a2z": self.get_csrf_token(response=response),
                    **self.default_headers,
                },
                callback=self.parse_session_cookies,
            )
    
        def parse_session_cookies(self, response: HtmlResponse) -> dict:
            """
            Return cookies dict if location changed successfully.
            """
            json_data = response.json()
            if not json_data.get("isValidAddress"):
                return {}
            return self.extract_response_cookies(response=response)
    
        @staticmethod
        def get_ajax_token(response: HtmlResponse) -> str:
            """
            Extract ajax token from response.
            """
            data = response.xpath("//input[@id='glowValidationToken']/@value").get()
            if not data:
                raise ValueError("Invalid page content")
            return data
    
        @staticmethod
        def get_csrf_token(response: HtmlResponse) -> str:
            """
            Extract CSRF token from response.
            """
            csrf_token = response.css("script").re_first(r'CSRF_TOKEN : "(.+?)"')
            if not csrf_token:
                raise ValueError("CSRF token not found")
            return csrf_token
    
        @staticmethod
        def extract_response_cookies(response: HtmlResponse) -> dict:
            """
            Extract cookies from response object
            and return it in valid format.
            """
            cookies = {}
            cookie_headers = response.headers.getlist("Set-Cookie", [])
            for cookie_str in cookie_headers:
                cookie = SimpleCookie()
                cookie.load(cookie_str.decode("utf-8"))
                for key, raw_value in cookie.items():
                    cookies[key] = raw_value.value
            return cookies
    

    Shell command:

     scrapy crawl amazon.com:location-session -a country=US -a zip_code=30332