Search code examples
pythonpandasdataframeparsingpython-requests-html

Conundrum: failed attempt to add a url column since the url was used to extract the content originally


I followed this tutorial (parse/scrape with python requests-html) successfully . However, as I was about to adjust the code to add a column that contains the url, but then I realized that the class I was about to use (.question-hyperlink) was already used to parse the question itself.

How would you add a url column to this code?

result:

https://i.sstatic.net/xZ4hD.jpg

attempt:

def parse_tagged_page(html):
    question_summaries = html.find(".question-summary")
    key_names = ['question', 'votes', 'tags','summary', 'url']
    classes_needed = ['.question-hyperlink', '.vote', '.tags', '.summary', '.question-hyperlink' ]
    datas = []
    for q_el in question_summaries:
        question_data = {}
        for i, _class in enumerate(classes_needed):
            sub_el = q_el.find(_class, first=True)
            keyname = key_names[i]
            question_data[keyname] = clean_scraped_data(sub_el.text, keyname=keyname)
        datas.append(question_data)
    return datas

Solution

  • URL is contained in href attribute of the a element and passing sub_el.text to function clean_scraped_data() will not help. You probably should refactor this function:

    def clean_scraped_data(el, keyname=None):
        if keyname == 'votes':
            return el.text.replace('\nvotes', '')
        elif keyname == 'url':
            return f"https://stackoverflow.com{el.attrs['href']}"
        return el.text
    

    Accordingly should be adjusted the function call:

    clean_scraped_data(sub_el, keyname=keyname)