First question here, I hope it makes sense how I write this out.
I am searching a massive lists of emails, and if they are found in google (I am in germany, thus the german in the strings) updating the email validity column in the dataframe to reflect it... but it is not saving. It prints correctly, but checking afterwords, it has not stored the iterated values.
# Script googling emails
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://google.de/search?q="Nicolas Cage"')
pyautogui.press('tab', presses=4)
pyautogui.press('enter')
df['email_validity'] = None
for email, domain_validity, email_validity in zip(df['email'], df['domain_validity'], df['email_validity']):
if domain_validity == True:
try:
driver.get(f'https://google.de/search?q="{email}" after:1990')
time.sleep(3) # loading url
"""pyautogui.hotkey('escape', presses=2)"""
time.sleep(2)
if 'die alle deine Suchbegriffe enthalten' not in driver.page_source and 'übereinstimmenden Dokumente gefunden'not in driver.page_source and 'Es wurden keine Ergebnisse gefunden' not in driver.page_source:
email_validity = True
print(email_validity)
elif 'not a robot' in driver.page_source:
print('help me!')
input("write anything, and press enter:")
else:
email_validity = False
print(email_validity)
except:
print(email)
else:
email_validity = domain_validity
driver.close()
print('completed')
df.head()
You haven't updated df
in the loop. Your variables email
, domain_validity
, and email_validity
contain the values from the tuple returned by zip()
. Changing them does not modify the dataframe.
You need to update the dataframe using df.at
at the end.
for index, email in enumerate(df['email']):
email_validity = None
# the rest of your code
df.at[index, 'email_validity'] = email_validity
You could also extract your email validation check to a separate function, and use apply()
on the whole column instead of looping. You can remove the if domain_validity == True:
check and use that as a lambda function on apply
instead.
That might not be straightforward for you since the 'not a robot'
case needs to be handled and return a value.
def check_email_validity(email):
try:
driver.get(f'https://google.de/search?q="{email}" after:1990')
time.sleep(3) # loading url
"""pyautogui.hotkey('escape', presses=2)"""
time.sleep(2)
if 'die alle deine Suchbegriffe enthalten' not in driver.page_source and 'übereinstimmenden Dokumente gefunden'not in driver.page_source and 'Es wurden keine Ergebnisse gefunden' not in driver.page_source:
return True
elif 'not a robot' in driver.page_source:
print('help me!')
input("write anything, and press enter:")
# !!!!!!!!!!! This will need to return something
else:
return False
except:
print(email)
return None
df['email_validity'] = df.apply(lambda x: check_email_validity(x['email']) if x['domain_validity'] else False, axis=1)