Search code examples
pythonhtmlseleniumweb-scrapingtags

How can I web scrape information from a website that has all the tags in the <pre>preformatted tag section?


I am creating a python crawler that scrapes information from the Interpol website. I was successfully able to scrape information from the first page like names of people, date of birth, nationality etc. In order to scrape information from the second page, I first got the URL from tag and clicked on the link using my program. When I went to the URL, I found out that all the information(meaning all the tags) were in the < pre > tag section. I am confused about why that is the case. So my question is how can I get information from inside the pre-tag section where all the other tags are. I am trying to get names of people, birthdays, their corresponding links, etc. I am using selenium btw. I will put down the URL of the website. And the URL of the second page that I found in the tag. I hope that helps you guys understand what I am talking about.

Main Website: https://www.interpol.int/en/How-we-work/Notices/View-Red-Notices

The second-page link I found in the tag: https://ws-public.interpol.int/notices/v1/red?resultPerPage=20&page=2

The code for the problem I have so far will be posted down below:

from selenium import webdriver
from selenium.common.exceptions import StaleElementReferenceException, NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = 'https://www.interpol.int/en/How-we-work/Notices/View-Red-Notices'

driver = webdriver.Chrome(executable_path="c:\\SeliniumWebDrivers\\chromedriver.exe")
driver.get(url)  //to go the website
url = [] //to get all the URLs of the people
names = [] //to get the names of the peoples
age = [] //to get the age of the people
nationality = [] //to get the nationality of the people
newwindow = [] //to get all the next page links
y = 0
g = 1
   try:
        driver.get(driver.current_url)

        main = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.ID, 'noticesResultsItemList'))
        )

        links = main.find_elements_by_tag_name("a")
        years = main.find_elements_by_class_name("age")
        borns = main.find_elements_by_class_name("nationalities")

        for link in links:
            newurl = link.get_attribute('href')
            url.append(newurl)
            names.append(link.text) //adding the names
            y += 1

        for year in years:
            age.append(year.text)  //adding the age to list

        for nation in borns:
            nationality.append(nation.text) //adding the nationality to list


        driver.get(driver.current_url)
        driver.refresh()
        next = WebDriverWait(driver, 15).until(
           EC.presence_of_element_located((By.ID, 'paginationPanel'))
        )
        pages = next.find_elements_by_tag_name("a")
        for page in pages:
           newlink = page.get_attribute('href')
           newwindow.append(newlink)

         #to get to the next page
         print(newwindow[2])
         driver.get(newwindow[2])
````

Solution

  • you can use selenium to click next page instead of getting the url. This is a just a simple ,you may need to use a loop and extract data and click next page. I've use variable browser instead of main.I've written a function and used a for loop to get the data from each page

    from selenium import webdriver

    import time
    from selenium.common.exceptions import NoSuchElementException,ElementNotInteractableException
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    
    
    browser = webdriver.Chrome('/home/cam/Downloads/chromedriver')
    url='https://www.interpol.int/en/How-we-work/Notices/View-Red-Notices'
    browser.get(url)
    
    def get_data():
        links = browser.find_elements_by_tag_name("a")
        years = browser.find_elements_by_class_name("age")
        borns = browser.find_elements_by_class_name("nationalities")
        time.sleep(5)
        try:
            
            browser.find_element_by_xpath('//*[@id="privacy-cookie-banner__privacy-accept"]').click()
        except ElementNotInteractableException:
            pass
        
    for i in range(1,9):
        print(i)
        get_data()
        print('//*[@id="paginationPanel"]/div/div/ul/li['+str(i+2)+']/a')
        b=WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="paginationPanel"]/div/div/ul/li['+str(i+2)+']/a')))
        b.click()
        time.sleep(10)