python web-scraping playwright playwright-python

Playwright - scraping eBay deals

from playwright.sync_api import Playwright, sync_playwright

with sync_playwright() as playwright:
    chromium = playwright.chromium
    browser = chromium.launch()
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://www.ebay.com/deals/tech/ipads-tablets-ereaders")
    button = page.locator("button.load-more-btn.btn.btn--secondary")
    try:
        while button:
            button.scroll_into_view_if_needed()
            button.click()
    except:
        pass
        items = page.locator("div.dne-itemtile.dne-itemtile-large").all()
        for item in items:
            print(item.locator("img").get_attribute("src"))
            print(item.locator("span.first").text_content())
            print(item.locator("span.ebayui-ellipsis-2").text_content())
            print()
        print(len(items), "items")

I am trying to scrape eBay deals.
In my try block, with headless = False, I would see the browser click the button to show me until this is no more button but the code will not scrape all the items but maybe the first 4 pages max.

On eBay's deal there can be more than 800 items, but I would be able to scrape the first 96

Solution

In short, when you click (or scroll down), the server sends a request (you can view it in developer mode) to retrieve deals. You can obtain deals using only requests, without worrying about Playwright or Selenium.

Example:

import time
import json
import requests
from bs4 import BeautifulSoup

LISTINGS_URL = "https://www.ebay.com/deals/spoke/ajax/listings"
TIMEZONE_OFFSET = 63072000

def get_dp1():
    current_time = hex(int(time.time()) + TIMEZONE_OFFSET)[2:]
    return f"bbl/DE{current_time}^"

def parse_deals(content):
    soup = BeautifulSoup(content, "lxml")
    items = []
    for el in soup.select("div[data-listing-id]"):
        image = el.select_one("img").get("src")
        price = el.select_one("span.first").text
        title = el.select_one("span.ebayui-ellipsis-2").text
        items.append({"title": title, "price": price, "image": image})
    return items

items = []

with requests.Session() as session:
    session.cookies.set("dp1", get_dp1())
    params = {"_ofs": 0, "category_path_seo": "tech,ipads-tablets-ereaders"}
    while True:
        print(f"Total: {len(items):<5} | Offset: {params['_ofs']}")
        response = session.get(LISTINGS_URL, params=params)
        data = response.json().get("fulfillmentValue", {})
        params = data.get("pagination", {}).get("params")
        if not params:
            break
        ditems = parse_deals(data["listingsHtml"])
        items.extend(ditems)

with open("data.json", "w") as f:
    json.dump(items, f, ensure_ascii=False, indent=2)

Output:

[
  {
    "title": "Samsung Galaxy Tab A9+ 11.0\" 64GB Gray Wi-Fi Tablet Bundle SM-X210NZAYXAR 2023",
    "price": "$139.99",
    "image": "https://i.ebayimg.com/images/g/qbUAAOSw1o1l1Rtt/s-l300.jpg"
  },
  ...
]

To obtain deals, as mentioned earlier, the server sends a GET request with a mandatory cookie dp1, which represents the current Unix time (for example, bbl/DE6a9839a1^). Here, bbl/DE and ^ are constant values (as I understand it), and between them is the current Unix time in hexadecimal format.

You may need to adjust the Unix time offset, as when you access the site, it sends the value of the cookie dp1 relative to its own timezone.

After that, the server responds with a JSON object that contains all the necessary information for scraping.