Search code examples
python-3.xweb-scrapingbeautifulsouphtml-parsing

Can’t find a tag that I know is in the document - find_all() returns []


I am using bs4 to scrape https://www.khanacademy.org/profile/DFletcher1990/ one user profile on khanacademy.

I am trying to get User Statistics data (date joined, energy point earned, videos completed).

I have check https://www.crummy.com/software/BeautifulSoup/bs4/doc/

It seems that : "The most common type of unexpected behavior is that you can’t find a tag that you know is in the document. You saw it going in, but find_all() returns [] or find() returns None. This is another common problem with Python’s built-in HTML parser, which sometimes skips tags it doesn’t understand. Again, the solution is to install lxml or html5lib."

I tried different parser methods but I got the same problem.

from bs4 import BeautifulSoup
import requests

url = 'https://www.khanacademy.org/profile/DFletcher1990/'

res = requests.get(url)

soup = BeautifulSoup(res.content, "lxml")

print(soup.find_all('div', class_='profile-widget-section'))

My code is returning [].


Solution

  • The page content is loaded using javascript. Most simple way to check if the content is dynamic is to right click and view the page source and check if the content is present there. You can also turn off the javascript in your browser and go to the url.

    You can use selenium to get the content

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from bs4 import BeautifulSoup
    driver = webdriver.Chrome(executable_path='/path/to/chromedriver')
    driver.get("https://www.khanacademy.org/profile/DFletcher1990/")
    element=WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH ,'//*[@id="widget-list"]/div[1]/div[1]/div[2]/div/div[2]/table')))
    source=driver.page_source
    soup=BeautifulSoup(source,'html.parser')
    user_info_table=soup.find('table', class_='user-statistics-table')
    for tr in user_info_table.find_all('tr'):
        tds=tr.find_all('td')
        print(tds[0].text,":",tds[1].text) 
    

    Output:

    Date joined : 4 years ago
    Energy points earned : 932,915
    Videos completed : 372
    

    Another option available (since you are already familiar with requests ) is to use requests-html

    from bs4 import BeautifulSoup
    from requests_html import HTMLSession
    session = HTMLSession()
    r = session.get('https://www.khanacademy.org/profile/DFletcher1990/')
    r.html.render(sleep=10)
    soup=BeautifulSoup(r.html.html,'html.parser')
    user_info_table=soup.find('table', class_='user-statistics-table')
    for tr in user_info_table.find_all('tr'):
        tds=tr.find_all('td')
        print(tds[0].text,":",tds[1].text) 
    

    Output

    Date joined : 4 years ago
    Energy points earned : 932,915
    Videos completed : 372
    

    Yet another option would be to find out the ajax request being made and emulate that and parse the response. This response need not always be json. But in this case the the content is not sent to the browser via an ajax response. It is already present in the page source.

    enter image description here

    The page simply uses javascript to structure this info. You can try to get the data out of that script tag, this might possibly involve some regex and then making a json out of the string.