Search code examples
pythondataframefor-loopiterationkey-value-store

For loop not saving to dataframe


First question here, I hope it makes sense how I write this out.

I am searching a massive lists of emails, and if they are found in google (I am in germany, thus the german in the strings) updating the email validity column in the dataframe to reflect it... but it is not saving. It prints correctly, but checking afterwords, it has not stored the iterated values.

#  Script googling emails
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://google.de/search?q="Nicolas Cage"')
pyautogui.press('tab', presses=4)
pyautogui.press('enter')

df['email_validity'] = None

for email, domain_validity, email_validity in zip(df['email'], df['domain_validity'], df['email_validity']):
    if domain_validity == True:
        try:
            driver.get(f'https://google.de/search?q="{email}" after:1990')
            time.sleep(3)     # loading url
            """pyautogui.hotkey('escape', presses=2)"""
            time.sleep(2)
            if 'die alle deine Suchbegriffe enthalten' not in driver.page_source and 'übereinstimmenden Dokumente gefunden'not in driver.page_source and 'Es wurden keine Ergebnisse gefunden' not in driver.page_source:
                email_validity = True
                print(email_validity)
            elif 'not a robot' in driver.page_source:
                print('help me!')
                input("write anything, and press enter:")
            else:
                email_validity = False
                print(email_validity)
        except:
            print(email)
    else:
        email_validity = domain_validity
        
driver.close()
print('completed')

df.head()

Solution

  • You haven't updated df in the loop. Your variables email, domain_validity, and email_validity contain the values from the tuple returned by zip(). Changing them does not modify the dataframe.

    df.at

    You need to update the dataframe using df.at at the end.

    for index, email in enumerate(df['email']):
        email_validity = None
    
        # the rest of your code
    
        df.at[index, 'email_validity'] = email_validity
    

    df.apply()

    You could also extract your email validation check to a separate function, and use apply() on the whole column instead of looping. You can remove the if domain_validity == True: check and use that as a lambda function on apply instead.

    That might not be straightforward for you since the 'not a robot' case needs to be handled and return a value.

    def check_email_validity(email):
        try:
            driver.get(f'https://google.de/search?q="{email}" after:1990')
            time.sleep(3)     # loading url
            """pyautogui.hotkey('escape', presses=2)"""
            time.sleep(2)
            if 'die alle deine Suchbegriffe enthalten' not in driver.page_source and 'übereinstimmenden Dokumente gefunden'not in driver.page_source and 'Es wurden keine Ergebnisse gefunden' not in driver.page_source:
                return True
            elif 'not a robot' in driver.page_source:
                print('help me!')
                input("write anything, and press enter:")
                # !!!!!!!!!!! This will need to return something
            else:
                return False
        except:
            print(email)
    
        return None
    
    df['email_validity'] = df.apply(lambda x: check_email_validity(x['email']) if x['domain_validity'] else False, axis=1)