Search code examples
pythonhtmlparsingpython-requestshtml-parsing

Steps for requests-html to parse more than one tag/class in python


Problem Introduction Language version: Python 3.8 Operating system: Windows 10 Any other relevant software: Jupyter Notebook and html-requests

Context: I am following along with this tutorial on parsing websites with requests-html.

Problem statement:

Goal: My goal is to learn more by applying his code on a more difficult website (stackoverflow, for example.) I successfully isolated the 'div' tag/class using the code below. I now intend to sort everything on stackoverflow's recent questions page that is labeled div to find the 'question-summary' and somehow isolate the question ID.

Expected outcome:

  • I want to isolate the question ID, save the associated html page for that unique question, and read each html page for each question that are in the first 3 pages (150 questions) of the most recent questions posted.

Problem: At the 17:29 in the video, he points out that the tag/class he using a selector on was only used once and he would "need to go back to the drawing board" if it was used more than once.

I am trying to search for something relating to either 'id' or question-summary-#' . I am not sure what I am searching for but I know that there will be more than one. What is the next step?

Example result of current code:

<Element 'div' class=('question-summary',) id='question-summary-64050283'>, 

Things I have tried: Current code:

import datetime
import requests
import requests_html
from requests_html import HTML
from importlib import reload
import sys
reload(sys)

now=datetime.datetime.now()
month=now.month
day=now.day
year=now.year
hour=now.hour
minute=now.minute
second=now.second

def url_to_txt(url, filename="world.html", save=False):
    r=requests.get(url)
    if r.status_code == 200:
        html_text=r.text
        if save:
            with open(f"world-{month}-{day}-{year}-{hour}-{minute}-{second}.html", 'w') as f:
                f.write(html_text)
        return html_text
    return ""

url = 'https://stackoverflow.com/questions?tab=newest&page=2'

html_text = url_to_txt(url)

r_html=HTML(html=html_text)
table_class = "div"
r_table = r_html.find(table_class)

print(r_table)


Solution

  • Focusing specifically on getting the question-summary-xxx values from the id attributes, you can try something like this:

    from requests_html import HTMLSession
    
    session = HTMLSession()
    url = 'https://stackoverflow.com/questions?tab=newest&pagesize=50'
    
    r = session.get(url)
    targets = r.html.xpath('//div[starts-with(@id,"question-summary-")]/@id')
    
    print(targets)
    

    Output:

    ['question-summary-64248540',
     'question-summary-64248536',
     'question-summary-64248535',
     'question-summary-64248530',
    ...]
    
    etc.