python html python-3.x beautifulsoup python-requests-html

Scrape web page <ul> <li> (Python)

Question:

There is a website https://au.pcpartpicker.com/products/cpu/overall-list/#page=1 with a list <li> under a <ul> each item in the list contains a <div> with class title in that class there are 2 more <div> elements the first one has some text example 3.4 GHz 6-Core (Pinnacle Ridge) i want to remove all text not in the brackets to get Pinnacle Ridge. After the list is scraped i want to move onto the next page by changing #page=.

Code:

I'm not too sure an only have snippets but here it is:

from requests_html import HTMLSession session = HTMLSession()

r = session.get('https://au.pcpartpicker.com/product/cpu/overall-list/#page=' + page)

table = r.html.find('.ul')

//not sure find each <li> get first <div>

junk, name = div.split('(')

name.replace("(", "")

name.replace(")", "")

Expected Result:

I want to loop through each page until there are none left finding each list and getting the name it doesn't need to be saved as i have code to save it when it is created.

If you need any more information please let me know

Thanks

Solution

The site is dynamic, thus, you will have to use selenium to produce the desired results:

from bs4 import BeautifulSoup as soup
from selenium import webdriver
import time, re
d = webdriver.Chrome('/path/to/chromdriver')
d.get('https://au.pcpartpicker.com/products/cpu/overall-list/#page=1')
def cpus(_source):
  result = soup(_source, 'html.parser').find('ul', {'id':'category_content'}).find_all('li')
  _titles = list(filter(None, [(lambda x:'' if x is None else x.text)(i.find('div', {'class':'title'})) for i in result]))
  data = [list(filter(None, [re.findall('(?<=\().*?(?=\))', c.text) for c in i.find_all('div')])) for i in result]
  return _titles, [a for *_, [a] in filter(None, data)]


_titles, _cpus = cpus(d.page_source))
conn.executemany("INSERT INTO cpu (name, family) VALUES (?, ?)", list(zip(_titles, _cpus)))
_last_page = soup(d.page_source, 'html.parser').find_all('a', {'href':re.compile('#page\=\d+')})[-1].text
for i in range(2, int(_last_page)+1):
   d.get(f'https://au.pcpartpicker.com/products/cpu/overall-list/#page={i}') 
   time.sleep(3)
   _titles, _cpus = cpus(d.page_source))
   conn.executemany("INSERT INTO cpu (name, family) VALUES (?, ?)", list(zip(_titles, _cpus)))