I am creating a python crawler that scrapes information from the Interpol website. I was successfully able to scrape information from the first page like names of people, date of birth, nationality etc. In order to scrape information from the second page, I first got the URL from tag and clicked on the link using my program. When I went to the URL, I found out that all the information(meaning all the tags) were in the < pre > tag section. I am confused about why that is the case. So my question is how can I get information from inside the pre-tag section where all the other tags are. I am trying to get names of people, birthdays, their corresponding links, etc. I am using selenium btw. I will put down the URL of the website. And the URL of the second page that I found in the tag. I hope that helps you guys understand what I am talking about.
Main Website: https://www.interpol.int/en/How-we-work/Notices/View-Red-Notices
The second-page link I found in the tag: https://ws-public.interpol.int/notices/v1/red?resultPerPage=20&page=2
The code for the problem I have so far will be posted down below:
from selenium import webdriver
from selenium.common.exceptions import StaleElementReferenceException, NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'https://www.interpol.int/en/How-we-work/Notices/View-Red-Notices'
driver = webdriver.Chrome(executable_path="c:\\SeliniumWebDrivers\\chromedriver.exe")
driver.get(url) //to go the website
url = [] //to get all the URLs of the people
names = [] //to get the names of the peoples
age = [] //to get the age of the people
nationality = [] //to get the nationality of the people
newwindow = [] //to get all the next page links
y = 0
g = 1
try:
driver.get(driver.current_url)
main = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, 'noticesResultsItemList'))
)
links = main.find_elements_by_tag_name("a")
years = main.find_elements_by_class_name("age")
borns = main.find_elements_by_class_name("nationalities")
for link in links:
newurl = link.get_attribute('href')
url.append(newurl)
names.append(link.text) //adding the names
y += 1
for year in years:
age.append(year.text) //adding the age to list
for nation in borns:
nationality.append(nation.text) //adding the nationality to list
driver.get(driver.current_url)
driver.refresh()
next = WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.ID, 'paginationPanel'))
)
pages = next.find_elements_by_tag_name("a")
for page in pages:
newlink = page.get_attribute('href')
newwindow.append(newlink)
#to get to the next page
print(newwindow[2])
driver.get(newwindow[2])
````
you can use selenium
to click next page instead of getting the url
. This is a just a simple ,you may need to use a loop and extract data and click next page. I've use variable browser
instead of main
.I've written a function and used a for loop to get the data from each page
from selenium import webdriver
import time
from selenium.common.exceptions import NoSuchElementException,ElementNotInteractableException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
browser = webdriver.Chrome('/home/cam/Downloads/chromedriver')
url='https://www.interpol.int/en/How-we-work/Notices/View-Red-Notices'
browser.get(url)
def get_data():
links = browser.find_elements_by_tag_name("a")
years = browser.find_elements_by_class_name("age")
borns = browser.find_elements_by_class_name("nationalities")
time.sleep(5)
try:
browser.find_element_by_xpath('//*[@id="privacy-cookie-banner__privacy-accept"]').click()
except ElementNotInteractableException:
pass
for i in range(1,9):
print(i)
get_data()
print('//*[@id="paginationPanel"]/div/div/ul/li['+str(i+2)+']/a')
b=WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="paginationPanel"]/div/div/ul/li['+str(i+2)+']/a')))
b.click()
time.sleep(10)