I'm trying to scrape a web application to get values of the table . How do I scrape the table every time new values are added to the table or otherwise how can I scrape the website? The website
My basic code only lets me scrape manually resulting to many values not being scraped. Also,
driver.find_elements_by_xpath
does not return anything but
WebDriverWait(driver, 100).until(EC.presence_of_all_elements_located()
works.
Below is my code
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
website = "https://play.pakakumi.com/"
path = r'D:\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(path)
driver.get(website)
page = driver.page_source
soup = BeautifulSoup(page, 'html.parser')
'''
k =driver.find_elements_by_xpath('/html/body/div/div[2]/div[2]/div/div[1]/div/div[3]/div/div[2]/div/table/tbody/tr/td[1]')
for item in k:
print(item.text)
'''
foo = WebDriverWait(driver, 100).until(EC.presence_of_all_elements_located((By.XPATH, '/html/body/div/div[2]/div[2]/div/div[1]/div/div[3]/div/div[2]/div/table/tbody/tr/td[1]')))
for b in foo:
print(b.text)
#print(foo)
Note: The full definitions for all 3 functions are pasted here, and the outputs have been uploaded to this spreadsheet. [Btw, I used CSS selectors because I'm more comfortable with them, but the XPath equivalents are probably not too different.]
You can scrape the link in the first row [as thref
below] and then wait until it ["link in first column of first row"] changes
# wait = WebDriverWait(driver, maxWait)
# while rowCt < maxRows and tmoCt < max_tmo:...
# parsing the whole table to gather as much data as possible
tSoup = BeautifulSoup(driver.find_element(
By.CSS_SELECTOR, 'table:has(th.text-center) tbody'
).get_attribute('outerHTML'), 'html.parser')
# get link from first column of first row
thref = tSoup.select_one(
f'tr:first-child>td:first-child>a[href]'
).get('href')
################### scrape rows' data from tSoup ###################
try:
thref = thref.replace('\\/', '/').replace('/', '\\/')
thSel = 'table:has(th.text-center) tbody>tr:first-child>'
thSel += 'td:first-child>a[href^="\/games\/"]'
wait.until(EC.presence_of_all_elements_located((
By.CSS_SELECTOR, f'{thSel}:not([href="{thref}"])')))
except: tmoCt += 1 # program ends if tmoCt gets too high
Using this, the first function (scrape_pakakumi_lim
) tries to scrape a certain number of rows (maxRows
) and then uses pandas to save the scraped data to opfn
("pakakumi.csv" by default).
The main issues are that
maxRows
[so you can't just scrape without pre-set limits]maxRows
, you could end up using too much memoryscrape_pakakumi
[the 3rd and last function] depends on scrape_pakakumi_api
[the 2nd function] to gather extra data using the API [which returns a JSON response if everything goes ok]. The API can fail sometimes, especially if too many requests are sent too rapidly; in such cases just the hash
and crash
from the table are saved, but created_at
is left empty and no plays
are added.
It lets you remove the maxRows
limitation by setting it to None
(although the default is 999
),and also allows you to specify how many new rows you want to wait to load (but it must be lower than 39 because the table only has 40 rows). Instead of checking that the first cell does not not contain the same link anymore, it checks if the the link that used to be at the top is now below the nth row [n=wAmt
below]. (Don't forget that maxWait
should be adjusted to allow enough time for n new rows to load.)
# wait = WebDriverWait(driver, maxWait)
thSel = 'table:has(th.text-center) tbody'
if isinstance(wAmt, int) and 1 < wAmt < 39:
thSel = f'{thSel}>tr:nth-child({wAmt})~tr>td:first-child'
else: thSel = f'{thSel}>tr:first-child~tr>td:first-child'
wait.until(EC.presence_of_all_elements_located((
By.CSS_SELECTOR, f'{thSel}>a[href="{thref}"]')))
# except: tmoCt += 1
If wAmt
is passed as a float [like 10.0
or 3.5
], then the program just sleeps for that many seconds instead of scanning for new rows.
if isinstance(wAmt, float):
if not gData: # only wait if there's no new data
time.sleep(wAmt)
continue # skip rest of loop
# try....except: tmoCt += 1
Both solutions keep track of previously added game_id
s and check against them to avoid duplicates.
In solution 1, addedIds
is just initiated as an empty list and then the duplicates are simply filtered out using list comprehension.
addedIds, games, thref = [], [], '' # initiated outside loop
# and then inside the loop:
tGames = [t for t in tGames if t['game_id'] not in addedIds] # filter out duplicates
games += tGames # add to main list
# [main list (games) saved after loop]
In solution 2, the output file is checked for old data first, and since each game_id
if scraped [with API] for individually [in an inner loop], the duplicates are skipped with continue
. [The IDs are converted to string because read_csv
extracts them as numbers, and the JSON also has them as numbers, but they're initially extracted as string from the link.]
try: # [before loop]
prevData = pd.read_csv(gfn).to_dict('records') # get data from previous scrape
addedIds = [str(g['game_id']) for g in prevData if 'game_id' in g][-1*maxIds:]
except: addedIds = []
maxIds = maxRows if maxRows and 100 < maxRows < 500 else 100 # for trimming
# and then inside the loop:
addedIds = addedIds[-1*maxIds:] # to reduce memory-usage a bit
# scrape table
for tg in tGames:
if str(tg['game_id']) in addedIds: continue
# tgg, tgp = scrape_pakakumi_api....
# save scraped data
addedIds += [str(g['game_id']) for g in gData]