Search code examples
pythonhtmlpython-3.xbeautifulsouppython-requests-html

Scrape web page <ul> <li> (Python)


Question:

There is a website https://au.pcpartpicker.com/products/cpu/overall-list/#page=1 with a list <li> under a <ul> each item in the list contains a <div> with class title in that class there are 2 more <div> elements the first one has some text example 3.4 GHz 6-Core (Pinnacle Ridge) i want to remove all text not in the brackets to get Pinnacle Ridge. After the list is scraped i want to move onto the next page by changing #page=.

Code:

I'm not too sure an only have snippets but here it is:

from requests_html import HTMLSession session = HTMLSession()

r = session.get('https://au.pcpartpicker.com/product/cpu/overall-list/#page=' + page)

table = r.html.find('.ul')

//not sure find each <li> get first <div>

junk, name = div.split('(')

name.replace("(", "")

name.replace(")", "")

Expected Result:

I want to loop through each page until there are none left finding each list and getting the name it doesn't need to be saved as i have code to save it when it is created.

If you need any more information please let me know

Thanks


Solution

  • The site is dynamic, thus, you will have to use selenium to produce the desired results:

    from bs4 import BeautifulSoup as soup
    from selenium import webdriver
    import time, re
    d = webdriver.Chrome('/path/to/chromdriver')
    d.get('https://au.pcpartpicker.com/products/cpu/overall-list/#page=1')
    def cpus(_source):
      result = soup(_source, 'html.parser').find('ul', {'id':'category_content'}).find_all('li')
      _titles = list(filter(None, [(lambda x:'' if x is None else x.text)(i.find('div', {'class':'title'})) for i in result]))
      data = [list(filter(None, [re.findall('(?<=\().*?(?=\))', c.text) for c in i.find_all('div')])) for i in result]
      return _titles, [a for *_, [a] in filter(None, data)]
    
    
    _titles, _cpus = cpus(d.page_source))
    conn.executemany("INSERT INTO cpu (name, family) VALUES (?, ?)", list(zip(_titles, _cpus)))
    _last_page = soup(d.page_source, 'html.parser').find_all('a', {'href':re.compile('#page\=\d+')})[-1].text
    for i in range(2, int(_last_page)+1):
       d.get(f'https://au.pcpartpicker.com/products/cpu/overall-list/#page={i}') 
       time.sleep(3)
       _titles, _cpus = cpus(d.page_source))
       conn.executemany("INSERT INTO cpu (name, family) VALUES (?, ?)", list(zip(_titles, _cpus)))