I am using bs4
to scrape https://www.khanacademy.org/profile/DFletcher1990/ one user profile on khanacademy.
I am trying to get User Statistics data (date joined, energy point earned, videos completed).
I have check https://www.crummy.com/software/BeautifulSoup/bs4/doc/
It seems that : "The most common type of unexpected behavior is that you can’t find a tag that you know is in the document. You saw it going in, but find_all()
returns []
or find()
returns None
. This is another common problem with Python’s built-in HTML parser, which sometimes skips tags it doesn’t understand. Again, the solution is to install lxml or html5lib."
I tried different parser methods but I got the same problem.
from bs4 import BeautifulSoup
import requests
url = 'https://www.khanacademy.org/profile/DFletcher1990/'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
print(soup.find_all('div', class_='profile-widget-section'))
My code is returning []
.
The page content is loaded using javascript. Most simple way to check if the content is dynamic is to right click and view the page source and check if the content is present there. You can also turn off the javascript in your browser and go to the url.
You can use selenium to get the content
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path='/path/to/chromedriver')
driver.get("https://www.khanacademy.org/profile/DFletcher1990/")
element=WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH ,'//*[@id="widget-list"]/div[1]/div[1]/div[2]/div/div[2]/table')))
source=driver.page_source
soup=BeautifulSoup(source,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
for tr in user_info_table.find_all('tr'):
tds=tr.find_all('td')
print(tds[0].text,":",tds[1].text)
Output:
Date joined : 4 years ago
Energy points earned : 932,915
Videos completed : 372
Another option available (since you are already familiar with requests ) is to use requests-html
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.khanacademy.org/profile/DFletcher1990/')
r.html.render(sleep=10)
soup=BeautifulSoup(r.html.html,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
for tr in user_info_table.find_all('tr'):
tds=tr.find_all('td')
print(tds[0].text,":",tds[1].text)
Output
Date joined : 4 years ago
Energy points earned : 932,915
Videos completed : 372
Yet another option would be to find out the ajax request being made and emulate that and parse the response. This response need not always be json. But in this case the the content is not sent to the browser via an ajax response. It is already present in the page source.
The page simply uses javascript to structure this info. You can try to get the data out of that script tag, this might possibly involve some regex and then making a json out of the string.