Search code examples
pythonbeautifulsoupurllib

how to get more items from a dynamically rendered webpage when webscraping


I m using python to web scrape restaurant names from Foodpanda. The page's items are all rendered through their <script>, so I cant get any data through their html css

foodpanda_url = "https://www.foodpanda.hk/restaurants/new?lat=22.33523782&lng=114.18249102&expedition=pickup&vertical=restaurants"

# send a request to the page, using the Mozilla 5.0 browser header
req = Request(foodpanda_url, headers={'User-Agent' : 'Mozilla/5.0'})
# open the page using our urlopen library
page = urlopen(req)

soup = BeautifulSoup(page.read(), "html.parser")
print(soup.prettify())

str_soup = str(soup.prettify())

I parse out the vendors string from str_soup using the following:

fp_vendors = list()
vendorlst = str_soup.split("\"discoMeta\":{\"reco_config\":{\"flags\":[]},\"traces\":[]},\"items\":")
opensqr = 0
startobj = 0

for i in range(len(vendorlst)):
if i==0:
    continue
else:
    for cnt in range(len(vendorlst[i])):
        if (vendorlst[i][cnt] == '['):
            opensqr += 1
        elif (vendorlst[i][cnt] == ']'):
            opensqr -= 1
        if opensqr == 0:
            vendorsStr = vendorlst[i][1:cnt]
            opencurly = 0
            for x in range(len(vendorsStr)):
                if vendorsStr[x] == ',':
                    continue
                if (vendorsStr[x] == '{'):
                    opencurly += 1
                elif (vendorsStr[x] == '}'):
                    opencurly -= 1
                if opencurly == 0:
                    vendor = vendorsStr[startobj:x+1]
                    if (vendor not in fp_vendors) and vendor != "":
                        fp_vendors.append(vendor)
                    startobj = x+2 #continue to next {
                    continue
            break

for item in fp_vendors:
#     print(item+"\n")
    itemstr = re.split("\"minimum_pickup_time\":[0-9]+,\"name\":\"", item)[1]
    itemstr = itemstr.split("\",")[0]
    print(itemstr+"\n")
print(len(fp_vendors))

However, this only returns a small number of restaurants like approximately 50. How can I get the code to "get" more restaurant items from Foodpanda? How do I simulate the "scrolling down" of the page so more items are loaded so that I can get more restaurant items?


Solution

  • Using Your browser dev-tools You can easily monitor all requests that are made. For you particular case I found this api call:

    https://disco.deliveryhero.io/listing/api/v1/pandora/vendors?latitude=22.33523782&longitude=114.18249102&language_id=1&include=characteristics&dynamic_pricing=0&configuration=Variant1&country=hk&customer_id=&customer_hash=&budgets=&cuisine=&sort=&food_characteristic=&use_free_delivery_label=false&opening_type=pickup&vertical=restaurants&limit=48&offset=48&customer_type=regular

    Here is complete solution to your problem:

    import json
    import requests
    
    items_list = []
    url = "https://disco.deliveryhero.io/listing/api/v1/pandora/vendors?latitude=22.33523782&longitude=114.18249102&language_id=1&include=characteristics&dynamic_pricing=0&configuration=Variant1&country=hk&customer_id=&customer_hash=&budgets=&cuisine=&sort=&food_characteristic=&use_free_delivery_label=false&opening_type=pickup&vertical=restaurants&limit=48&offset={}&customer_type=regular"
    
    for i in range(5):
        resp = requests.get(
            url.format(i * 48),
            headers={
                "x-disco-client-id": "web",
            },
        )
        if resp.status_code == 200:
            items_list += json.loads(resp.text)["data"]["items"]
        print(f"Finished page: {i}")
    
    print(items_list)