Search code examples
pythonhtmlpython-requests-html

Unable to get entire HTML page with Python requests


I'm working in a Cards Against Humanity game card editor. In order to obtain card ideas I was hoping to download programatically the entire decks from the following web page. Using the inspection tool, I spotted where the cards are stored:

Location of card description

As it can be seen, inside whitecards class and blackcards class, each card id can be found, where the card phrase or idea is written.

The general funcionality of my code is to provide a deck URL and obtein all card examples (white and black). My first approach has been using Requests package in Python. I used the following code:

import requests
from bs4 import BeautifulSoup

URL = 'https://cardslackingoriginality.com/expansions/5e758e4034489b003f4529f6/view'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

root = soup.find(id='root')

Nevertheless, when inspecting root object I found it empty, but it should contain all the whitecards and blackcards class.


Solution

  • As is often the case, a web page is not fully loaded on initial page load. Often after the page is loaded JavaScript code executes one or more AJAX requests that results in the DOM being modified and that is why fetching the page with requests does not yield the final, completed DOM. So I loaded the page in my browser and looked at the XHR network requests that were being made subsequent to page load. Yet none seemed to return the missing information. So this is a bit puzzling. The solution I have, therefore, is to use Selenium to drive a browser (Chrome in the example below) and scrape the page. It will be necessary to wait a second or two following the initial page load to ensure that the DOM is complete:

    from selenium import webdriver
    from bs4 import BeautifulSoup
    import time
    
    URL = 'https://cardslackingoriginality.com/expansions/5e758e4034489b003f4529f6/view'
    options = webdriver.ChromeOptions()
    options.add_argument("headless")
    options.add_experimental_option('excludeSwitches', ['enable-logging'])
    driver = webdriver.Chrome(options=options)
    driver.get(URL)
    time.sleep(1) # wait a second for <div id="root"> to be fully loaded
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.quit()
    root = soup.find(id='root')
    print(root)
    

    Update

    I looked a little more closely at the AJAX calls and it looks like the following URL will return the actual data you are interested in:

    https://cardslackingoriginality.com/expansions/5e758e4034489b003f4529f6/get
    
    import requests
    
    
    URL = 'https://cardslackingoriginality.com/expansions/5e758e4034489b003f4529f6/get'
    resp = requests.get(URL)
    print(resp.json())
    

    Prints:

    {'success': True, 'expansion': {'_id': '5e758e4034489b003f4529f6', 'name': 'Global Pandemic Pack', 'author': '5dfde1f4897a0f003e2fb547', 'description': "Who says in-house quarantine has to suck? For the price of a handful of toilet paper rolls, you can gain some original pandemic-themed cards that'll surely spice up your card games. Get your hands on the first-ever official Cards Lacking Originality card pack now! I mean it, right now!", 'price': 0, 'published': True, 'featured': True, 'dateCreated': '2020-03-21T03:47:12.167Z', '__v': 0, 'gamesUsed': 655, 'whiteCards': ['$1,200 Trump bucks.', 'A free extra week on the cruise ship!', 'A long Zoom meeting with no obvious purpose.', 'A lukewarm bowl of bat soup.', 'A mass panic caused by a sneeze.', 'Babies concieved under quarantine.', 'Beautiful cross-cultural friendships.', 'Binging 30 straight seasons of "The Simpsons."', 'Burying your head in a screen to escape family time.', 'Costco: Battle Royale.', 'Craving any excuse to party.', 'Crying and then sleeping and then crying.', 'Eating all the quarantine food within a day.', 'Ejaculating into the air and trying to catch it in your mouth.', 'Exchanging blowjobs for Kleenex and toilet paper.', 'Forgetting what genuine human connection feels like.', 'Groupons at funeral homes.', 'Hating the media.', 'Insatiable horniness.', 'Kung Flu fighting.', "My Gram-Gram's loooooong vacation!", 'Online class shootings.', 'Only washing hands after the CDC says you have to.', 'Plague, Inc.', 'Praying for the sweet release of death.', 'Raging Ebola.', 'Rediscovering the wonders of video games.', 'Some Lyme disease to go with your Coronavirus.', 'The National Guard.', 'The other eighteen COVIDs.', 'Unnecessarily sensual Zoom messages.'], 'blackCards': ['America: #1 in _______!', "Doctor, I've been doing _______ lately and I fear that I may be very sick.", 'I cannot BELIEVE that the grocery store is sold out of _______ already!', 'We regret to inform you that _______ has officially been cancelled due to COVID-19.', 'What is the one good thing about this pandemic?', 'What was the most difficult thing to give up for social distancing?', "What's really to blame for the spread of the virus?", "What's the best way to kill time while trapped inside the house?", "_______ is the entire reason I'm still holding onto some sanity."]}}