Search code examples
pythonseleniumselenium-chromedriverscreen-scraping

Scraping with Python and Selenium - how should I return a 'null' if element not present


Good Day, I am a newbie to Python and Selenium and have searched for the solution for a while now. While some answers come close, I can't see to find one that solves my problem. The snippet of my code that is a slight problem is as follows:

for url in links:
        driver.get(url)
        company = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[2]/ul/li/div/div[1]/span""")
        date = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[1]/div[1]/div[2]/div/span""")
        title = driver.find_elements_by_xpath("""//*[@id="page-title"]/span""")
        urlinf = driver.current_url #url info

        num_page_items = len(date)

        for i in range(num_page_items):
            df = df.append({'Company': company[i].text, 'Date': date[i].text, 'Title': title[i].text, 'URL': urlinf[i]}, ignore_index=True)

While this does work if all elements are present (and I can see the output to Pandas dataframe), if one of the elements doesn't exist (either 'date' or 'title') Python sends out the error:

IndexError: list index out of range

what I have tried thus far:

1) created a try/except (doesn't work) 2) tried if/else (if variable is not "")

I would like to insert "Null" if the element doesn't exist so that the Pandas dataframe populates with "Null" in the event an element doesn't exist.

any assistance and guidance would be greatly appreciated.

EDIT 1:

I have tried the following:

for url in links:
        driver.get(url)
    try:
            company = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[2]/ul/li/div/div[1]/span""")
            date = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[1]/div[1]/div[2]/div/span""")
            title = driver.find_elements_by_xpath("""//*[@id="page-title"]/span""")
            urlinf = driver.current_url #url info
        except:
        pass
        num_page_items = len(date)

        for i in range(num_page_items):
            df = df.append({'Company': company[i].text, 'Date': date[i].text, 'Title': title[i].text, 'URL': urlinf[i]}, ignore_index=True)

and:

for url in links:
        driver.get(url)
    try:
            company = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[2]/ul/li/div/div[1]/span""")
            date = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[1]/div[1]/div[2]/div/span""")
            title = driver.find_elements_by_xpath("""//*[@id="page-title"]/span""")
            urlinf = driver.current_url #url info
        except (NoSuchElementException, ElementNotVisibleException, InvalidSelectorException):
        pass

        num_page_items = len(date)

        for i in range(num_page_items):
            df = df.append({'Company': company[i].text, 'Date': date[i].text, 'Title': title[i].text, 'URL': urlinf[i]}, ignore_index=True)

and:

for url in links:
        driver.get(url)
    try:
            company = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[2]/ul/li/div/div[1]/span""")
            date = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[1]/div[1]/div[2]/div/span""")
            title = driver.find_elements_by_xpath("""//*[@id="page-title"]/span""")
            urlinf = driver.current_url #url info
        except:
          i = 'Null'
          pass

        num_page_items = len(date)

        for i in range(num_page_items):
            df = df.append({'Company': company[i].text, 'Date': date[i].text, 'Title': title[i].text, 'URL': urlinf[i]}, ignore_index=True)

I tried the same try/except at the point of appending to Pandas.

EDIT 2 the error I get:

IndexError: list index out of range

is attributed to the line:

df = df.append({'Company': company[i].text, 'Date': date[i].text, 'Title': title[i].text, 'URL': urlinf[i]}, ignore_index=True)


Solution

  • As your error shows you have an index error!

    To overcome that you should add a try except within the area that raises this error.

    Also, you are using the driver.current_url which returns the URL. But in your inner for loop you are trying to refer to it as a list... this can be the origin of your error...

    In your case try this:

    for url in links:
        driver.get(url)
        company = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[2]/ul/li/div/div[1]/span""")
        date = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[1]/div[1]/div[2]/div/span""")
        title = driver.find_elements_by_xpath("""//*[@id="page-title"]/span""")
        urlinf = driver.current_url #url info
    
        num_page_items = len(date)
        for i in range(num_page_items):
            try:
                df = df.append({'Company': company[i].text, 'Date': date[i].text, 'Title': title[i].text, 'URL': urlinf}, ignore_index=True)
            except IndexError:
                df.append(None) # or df.append('Null')
    

    Hope you find this helpfull!