python selenium-webdriver web-scraping cookies python-requests

How to grab new session cookies before scraping?

I would like to know, how to update session cookies before scraping. The Code is Working if open Firefox copie cookies by hand and inser them into the code. But every time I try doing it in code I get a 403 response. What am I doing wrong?

import requests
import pandas as pd
from selenium import webdriver
import time
from selenium.webdriver.chrome.options import Options

#Function to grab new cookies
def update_cookies(session):
    driver_path = 'C:/Program Files/chromedriver.exe'
    opts = Options()
    opts.add_argument("Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/116.0")
    driver = webdriver.Chrome(executable_path=driver_path,chrome_options=opts)
    driver.get('https://www.reuter.de/mariner-duschsystem-mit-eden-edge-thermostat-inkl-grundkoerper-kopfbrause-slim-300-mm-und-metall-brauseset-schwarz-matt-a1161798.php')
    time.sleep(10)
    cookies_new = driver.get_cookies()
    driver.quit()
    
    for c in cookies_new:
        print(str(c))
        if c['name'] == '__cf_bm':
            print('Found __cf_bm')
            session.cookies['__cf_bm'] = c['value']
        elif c['name'] == 'cf_clearance':
            print('Found cf_clearance')
            session.cookies['cf_clearance'] = c['value']
        elif c['name'] == 'XTCsid':
            print('Found XTCsid')
            session.cookies['XTCsid'] = c['value']

#Start a session
session = requests.Session()

#Old cookies to be updated (not necessary when update_cookies() is working)
cookies = {
    "user_locale_selection": "https://www.reuter.de",
    "cookie_test": "please_accept_for_session",
    "__cf_bm": "Oayhkm3Ps7J.onL1Km1NO3S2xXMlxDjPEbL3H4myz7Y-1691765714-0-ARw0RHJXQNMM7RN3aw1OTIbQNpRk4AQIY63pQdj3IRDeYzpAQm4gGmNKgsM2qVDw4X7i13zSLIvGV9eir0cw54BNxlWuVRKdkxWyaSEfH4fs",
    "cf_clearance": "2XeYVrW.r5zZAddHaHJTBglFsWJx8wRe140z1kLZJOw-1691765716-0-1-ee7419d2.733d08ea.a668f614-0.2.1691765716",
    "feedback": "1",
    "XTCsid": "s:a118e3fbb3a07fdd44c51ec9e44b4ad2.JMFQIZZCu/8CZRQgPoSYzdqva80TX2qycnxngYQ5BxY"
}

#Assign old cookies to session
session.cookies.update(cookies)

#Define headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/116.0",
    "Accept": "application/json",
    "Accept-Language": "de,en-US;q=0.7,en;q=0.3",
    "Accept-Encoding": "utf-8",
    "DNT": "1",
}

#Set request params and url
base_url = "https://www.reuter.de/services/products/"
params = {
    "$language": "de",
    "$select": "id,name,price,category,manufacturer,series,url"
}

#Empty List
all_ids = []

#Update cookies
update_cookies(session)


#Start scraping
while True:

    try:
        response = session.get(base_url, params=params, headers=headers)
        
        r = response.json()
        
        #Add r to list
        ##all_ids.extend(item["id"] for item in data.get("items", []))
        all_ids.extend(r['data'])

        #End condition
        if r["limit"] + r["skip"] >= r["total"]:
            break
        
        #Set next request
        params["$skip"] = r["limit"] + r["skip"]
        print(counter)
        print(r)
        counter = counter+1

    except Exception as e:
        print('Error:', e)
        print(response.status_code)
        break

#Info to df -> csv
df = pd.json_normalize(all_ids)
df.to_csv('all_perfect.csv',sep=';',encoding='utf-8',decimal=',')

As mentioned, the code works with updating cookies manual. But it doesn't if I use selenium to do so.

Solution

So, you're setting irrelevant data in Cookies + Headers that leads to incorrect result. I'm not sure what is correct combination for this site that leads to correct response via session.get(url), however, I can suggest you not the best, but working workaround how to get JSON that you need without cookies / headers, using Selenium.

I should define that it is not the best solution.

Reuters provides JSON with filter by accessing direct link and passing params to it. This is yours: https://www.reuter.de/services/products?$select=id,name,price,category,manufacturer,series,url&$language=de

By accessing it you would receive data in JSON format without 403 (as far as you do it through real browser).

You just need to get JSON element text and parse it.

import json
from selenium.webdriver.support.wait import WebDriverWait

driver.get(
        'https://www.reuter.de/services/products?$select=id,name,price,category,manufacturer,series,url&$language=de')
wait = WebDriverWait(driver, 10)
response = wait.until(EC.presence_of_element_located((By.TAG_NAME, 'pre'))).text
json_object = json.loads(response) #this is your r variable

In usual situation this operation should be done via request without using Selenium, but as far as you don't know proper access data, this solution can be used as a workaround.