I am working on a project, trying to scrape articles from archive websites. For example, below is an archive url and the original url. I have the archive url. And I want to use Selenium to extract the original url.
Arhive url: https://archive.is/xXAoL
Original url: https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
url = "https://archive.is/xXAoL"
driver = webdriver.Chrome('./chromedriver')
driver.get(url)
Any advice on how to get the original url?
Method 1
One thing that might work is that the canonical link is
https://archive.is/2021.09.07-145059/https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU
I could just strip out things up until the second https. However, that method is not working so looking for another method not relying on meta.
To extract the original url you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
driver.get('https://archive.is/xXAoL')
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[name='q'][value]"))).get_attribute("value"))
Using XPATH:
driver.get('https://archive.is/xXAoL')
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//input[@name='q'][@value]"))).get_attribute("value"))
Console Output:
https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC