My goal is to scrape some specific data on multiple profile pages on khan academy. And put the data on a csv file.
Here is the code to scrape one specific profile page and put it on a csv:
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.khanacademy.org/profile/DFletcher1990/')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
user_socio_table=soup.find_all('div', class_='discussion-stat')
data = {}
for gettext in user_socio_table:
category = gettext.find('span')
category_text = category.text.strip()
number = category.previousSibling.strip()
data[category_text] = number
filename = "khanscraptry1.csv"
f = open(filename, "w")
headers = "date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx\n"
f.write(headers)
f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "\n")
f.close()
This code is working fine with this specific link('https://www.khanacademy.org/profile/DFletcher1990/'
).
Now though when I change my link to an other profile on khan academy for example : 'https://www.khanacademy.org/profile/Kkasparas/'
I get this error :
KeyError: 'project help requests'
This is normal because on this profile "https://www.khanacademy.org/profile/Kkasparas/"
there is no project help requests
value (and no project help replies
either).
Thus data['project help requests']
and data['project help replies']
don't exist and thus can't be written on the csv file.
My goal is to run this script with many profile pages.
So I would like to know how to put an NA
in every case I will not get the data on each variable. And then print te NA
's to the csv file.
In other words : I would like to make my script work for any kind of user profile page.
Many thanks in advance for your contributions :)
You could define a new list with all possible headers and set the value of keys that are not present to 'NA', before writing it to the file.
full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks']
for header_value in full_data_keys:
if header_value not in data.keys():
data[header_value]='NA'
Also gentle reminder to provide a fully working code in your question. user_socio_table
was not defined in the question. I had to look up your previous question to get that.
Full code would be
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.khanacademy.org/profile/Kkasparas/')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
data = {}
user_socio_table=soup.find_all('div', class_='discussion-stat')
for gettext in user_socio_table:
category = gettext.find('span')
category_text = category.text.strip()
number = category.previousSibling.strip()
data[category_text] = number
full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks']
for header_value in full_data_keys:
if header_value not in data.keys():
data[header_value]='NA'
filename = "khanscraptry1.csv"
f = open(filename, "w")
headers = "date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx\n"
f.write(headers)
f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "\n")
f.close()
Ouput - khanscraptry1.csv
date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx
6 years ago,1527829,1123,25,100,2,0,NA,NA,0,0
Change to the following lines if user_info_table is not present
if user_info_table is not None:
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
else:
dates=points=videos='NA'