Search code examples
python-3.xcsvweb-scrapingbeautifulsouphtml-parsing

How to ask f.write() to put NA's if there is no data in beautifulsoup?


My goal is to scrape some specific data on multiple profile pages on khan academy. And put the data on a csv file.

Here is the code to scrape one specific profile page and put it on a csv:

from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.khanacademy.org/profile/DFletcher1990/')
r.html.render(sleep=5)

soup=BeautifulSoup(r.html.html,'html.parser')

user_info_table=soup.find('table', class_='user-statistics-table')

dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]

user_socio_table=soup.find_all('div', class_='discussion-stat')

data = {}
for gettext in user_socio_table:
   category = gettext.find('span')
   category_text = category.text.strip()
   number = category.previousSibling.strip()
   data[category_text] = number

filename = "khanscraptry1.csv"
f = open(filename, "w")
headers = "date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx\n"
f.write(headers)
f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "\n")
f.close()

This code is working fine with this specific link('https://www.khanacademy.org/profile/DFletcher1990/').

Now though when I change my link to an other profile on khan academy for example : 'https://www.khanacademy.org/profile/Kkasparas/'

I get this error :

KeyError: 'project help requests'

This is normal because on this profile "https://www.khanacademy.org/profile/Kkasparas/" there is no project help requests value (and no project help replies either).

Thus data['project help requests'] and data['project help replies'] don't exist and thus can't be written on the csv file.

My goal is to run this script with many profile pages. So I would like to know how to put an NA in every case I will not get the data on each variable. And then print te NA's to the csv file.

In other words : I would like to make my script work for any kind of user profile page.

Many thanks in advance for your contributions :)


Solution

  • You could define a new list with all possible headers and set the value of keys that are not present to 'NA', before writing it to the file.

    full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks']
    for header_value in full_data_keys:
        if header_value not in data.keys():
            data[header_value]='NA'
    

    Also gentle reminder to provide a fully working code in your question. user_socio_table was not defined in the question. I had to look up your previous question to get that.

    Full code would be

    from bs4 import BeautifulSoup
    from requests_html import HTMLSession
    session = HTMLSession()
    r = session.get('https://www.khanacademy.org/profile/Kkasparas/')
    r.html.render(sleep=5)
    soup=BeautifulSoup(r.html.html,'html.parser')
    user_info_table=soup.find('table', class_='user-statistics-table')
    dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
    data = {}
    user_socio_table=soup.find_all('div', class_='discussion-stat')
    for gettext in user_socio_table:
       category = gettext.find('span')
       category_text = category.text.strip()
       number = category.previousSibling.strip()
       data[category_text] = number
    full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks']
    for header_value in full_data_keys:
        if header_value not in data.keys():
            data[header_value]='NA'
    filename = "khanscraptry1.csv"
    f = open(filename, "w")
    headers = "date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx\n"
    f.write(headers)
    f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "\n")
    f.close()
    

    Ouput - khanscraptry1.csv

    date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx
    6 years ago,1527829,1123,25,100,2,0,NA,NA,0,0
    

    Change to the following lines if user_info_table is not present

    if user_info_table is not None:
        dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
    else:
        dates=points=videos='NA'