There is a website https://au.pcpartpicker.com/products/cpu/overall-list/#page=1 with a list <li>
under a <ul>
each item in the list contains a <div>
with class title in that class there are 2 more <div>
elements the first one has some text example 3.4 GHz 6-Core (Pinnacle Ridge) i want to remove all text not in the brackets to get Pinnacle Ridge. After the list is scraped i want to move onto the next page by changing #page=.
I'm not too sure an only have snippets but here it is:
from requests_html import HTMLSession session = HTMLSession()
r = session.get('https://au.pcpartpicker.com/product/cpu/overall-list/#page=' + page)
table = r.html.find('.ul')
//not sure find each <li> get first <div>
junk, name = div.split('(')
name.replace("(", "")
name.replace(")", "")
I want to loop through each page until there are none left finding each list and getting the name it doesn't need to be saved as i have code to save it when it is created.
If you need any more information please let me know
Thanks
The site is dynamic, thus, you will have to use selenium
to produce the desired results:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
import time, re
d = webdriver.Chrome('/path/to/chromdriver')
d.get('https://au.pcpartpicker.com/products/cpu/overall-list/#page=1')
def cpus(_source):
result = soup(_source, 'html.parser').find('ul', {'id':'category_content'}).find_all('li')
_titles = list(filter(None, [(lambda x:'' if x is None else x.text)(i.find('div', {'class':'title'})) for i in result]))
data = [list(filter(None, [re.findall('(?<=\().*?(?=\))', c.text) for c in i.find_all('div')])) for i in result]
return _titles, [a for *_, [a] in filter(None, data)]
_titles, _cpus = cpus(d.page_source))
conn.executemany("INSERT INTO cpu (name, family) VALUES (?, ?)", list(zip(_titles, _cpus)))
_last_page = soup(d.page_source, 'html.parser').find_all('a', {'href':re.compile('#page\=\d+')})[-1].text
for i in range(2, int(_last_page)+1):
d.get(f'https://au.pcpartpicker.com/products/cpu/overall-list/#page={i}')
time.sleep(3)
_titles, _cpus = cpus(d.page_source))
conn.executemany("INSERT INTO cpu (name, family) VALUES (?, ?)", list(zip(_titles, _cpus)))