Search code examples
selenium-webdriverweb-scraping

Problems with text extraction in Selenium


I have a problem with extracting text in a loop. The text is extracted in the first loop, but in the second loop, the field of interest is returned empty
The problem is with the answer field, while the question field is extracted every time.

driver = web_driver()
driver.get('https://www.medicitalia.it/consulti/?tag=cefalea')

data = []
# 2. Loop per navigare tra le pagine
for i in range(20):
    for urls in url:

        url = urls.get_attribute("href")
        print("URL QA: {}".format(url))
        print()

        curr_driver = web_driver()
        curr_driver.get(url)

        # 4. Apertura URL domande e estrazione dati:
        WebDriverWait(curr_driver, 20).until(
            EC.presence_of_element_located((By.XPATH, '//*[@id="question"]'))
        )
        

        #5. Extracting and Storing Data:
        try:
            question = curr_driver.find_element(By.XPATH, "//h1[contains(@class, 'consulto-h1') and contains(@class, 'px-2')]").text
        except:
            question = None


        try:
            answer = curr_driver.find_element(By.XPATH, "//div[contains(@class, 'col cons px-4 pt-4 pb-0')]").text
        except:
            answer = None

        # 6. Closing WebDriver Instances:
        curr_driver.quit()



# 6. Closing WebDriver Instances:
driver.quit()
<div class="col cons px-4 pt-4 pb-0"> "Cosa posso fare?" <br>
  <br> Una visita gnatologica! <br>
  <br> E' assolutamente verosimile che i sintomi che lei accusa derivano dalla posizione e dal movimento della mandibola. <br> I condili della mandibola si attaccano al cranio proprio vicino all'orecchio, ed ecco perchè vi è confusione dei sintomi attribuendoli falsamente all'apparato uditivo. <br>
  <br> La tensione muscolare spiega invece mal di testa e cervicale, contribuendo inoltre ad aumentare gli acufeni. <p class="my-2 text-right small">
    <i>
      <a href="https://www.medicitalia.it/specialita/gnatologia-clinica/?qurl=http%3A%2F%2Fwww.studioformentelli.it">www.studioformentelli.it</a>
      <br> Attività prevalente: Gnatologia e <br> Implantologia (scuola italiana) </i>
  </p>
</div>

I would like to understand why the answers remain empty from the second for loop onwards.


Solution

  • The problem you are facing because you have used a new webdriver inside the loop that doesn't handle the accept cookie popup.

    instead of accepting cookies for 20 times use the same web driver for crawling through the links.

    Try the following. Here I have just copied the links to a separate list from the list of elements. Then, I crawled those links with the same web driver where I have already handled the cookie popup.

    URL = "https://www.medicitalia.it/consulti/?tag=cefalea"
    driver.get(URL)
    wait.until(EC.element_to_be_clickable((By.ID,'pt-accept-all'))).click()
    links = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,"a.titconsulto[href]")))
    urls = []
    for link in links:
        urls.append(link.get_attribute('href'))
    
    for url in urls:
        driver.get(url)
        wait.until(EC.presence_of_element_located((By.XPATH,'//*[@id="question"]')))
        try:
            question = driver.find_element(By.XPATH, "//h1[contains(@class, 'consulto-h1') and contains(@class, 'px-2')]").text
        except:
            question = None
        print('Q: '+question)
        try:
            answer = driver.find_element(By.XPATH, "//div[contains(@class, 'col cons px-4 pt-4 pb-0')]").text
        except:
            answer = None
        print('\nA: '+answer)
        
    driver.quit()