Problem Introduction Language version: Python 3.8 Operating system: Windows 10 Any other relevant software: Jupyter Notebook and html-requests
Context: I am following along with this tutorial on parsing websites with requests-html.
Problem statement:
Goal: My goal is to learn more by applying his code on a more difficult website (stackoverflow, for example.) I successfully isolated the 'div' tag/class using the code below. I now intend to sort everything on stackoverflow's recent questions page that is labeled div to find the 'question-summary' and somehow isolate the question ID.
Expected outcome:
Problem: At the 17:29 in the video, he points out that the tag/class he using a selector on was only used once and he would "need to go back to the drawing board" if it was used more than once.
I am trying to search for something relating to either 'id' or question-summary-#' . I am not sure what I am searching for but I know that there will be more than one. What is the next step?
Example result of current code:
<Element 'div' class=('question-summary',) id='question-summary-64050283'>,
Things I have tried: Current code:
import datetime
import requests
import requests_html
from requests_html import HTML
from importlib import reload
import sys
reload(sys)
now=datetime.datetime.now()
month=now.month
day=now.day
year=now.year
hour=now.hour
minute=now.minute
second=now.second
def url_to_txt(url, filename="world.html", save=False):
r=requests.get(url)
if r.status_code == 200:
html_text=r.text
if save:
with open(f"world-{month}-{day}-{year}-{hour}-{minute}-{second}.html", 'w') as f:
f.write(html_text)
return html_text
return ""
url = 'https://stackoverflow.com/questions?tab=newest&page=2'
html_text = url_to_txt(url)
r_html=HTML(html=html_text)
table_class = "div"
r_table = r_html.find(table_class)
print(r_table)
Focusing specifically on getting the question-summary-xxx
values from the id
attributes, you can try something like this:
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://stackoverflow.com/questions?tab=newest&pagesize=50'
r = session.get(url)
targets = r.html.xpath('//div[starts-with(@id,"question-summary-")]/@id')
print(targets)
Output:
['question-summary-64248540',
'question-summary-64248536',
'question-summary-64248535',
'question-summary-64248530',
...]
etc.